[jira] [Commented] (HDFS-4115) TestHDFSCLI.testAll fails one test due to number format

2016-02-02 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128540#comment-15128540
 ] 

Tony Wu commented on HDFS-4115:
---

Hi [~sureshms], [~jingzhao] & [~scurrilous],

I ran into this error when running branch-2 TestHDFSCLI. To my surprise this 
patch was not ported to branch-2. Could you take a quick look and consider 
backporting the patch?

Thanks,
Tony

> TestHDFSCLI.testAll fails one test due to number format
> ---
>
> Key: HDFS-4115
> URL: https://issues.apache.org/jira/browse/HDFS-4115
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.0.0-alpha
> Environment: Apache Maven 3.0.4
> Maven home: /usr/share/maven
> Java version: 1.6.0_35, vendor: Sun Microsystems Inc.
> Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
> Default locale: en_US, platform encoding: ISO-8859-1
> OS name: "linux", version: "3.2.0-32-generic", arch: "amd64", family: "unix"
>Reporter: Trevor Robinson
>Assignee: Trevor Robinson
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HDFS-4115.patch
>
>
> This test fails repeatedly on only one of my machines:
> {noformat}
> Failed tests:   testAll(org.apache.hadoop.cli.TestHDFSCLI): One of the tests 
> failed. See the Detailed results to identify the command that failed
>Test ID: [587]
>   Test Description: [report: Displays the report about the Datanodes]
>  Test Commands: [-fs hdfs://localhost:35254 -report]
> Comparator: [RegexpComparator]
> Comparision result:   [fail]
>Expected output:   [Configured Capacity: [0-9]+ \([0-9]+\.[0-9]+ 
> [BKMGT]+\)]
>  Actual output:   [Configured Capacity: 472446337024 (440 GB)
> {noformat}
> The problem appears to be that {{StringUtils.byteDesc}} calls 
> {{limitDecimalTo2}} which calls {{DecimalFormat.format}} with a pattern of 
> {{#.##}}. This pattern does not include trailing zeroes, so the expected 
> regex is incorrect in requiring a decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2016-01-11 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092146#comment-15092146
 ] 

Tony Wu commented on HDFS-9493:
---

Thanks [~liuml07] & [~eddyxu] for the review and comments!

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Fix For: 2.8.0
>
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch, 
> HDFS-9493.003.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2016-01-04 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081284#comment-15081284
 ] 

Tony Wu commented on HDFS-9493:
---

Hi [~eddyxu], 

Could you take a look at the latest patch (v3) and let me know if you have 
comments? 

Thanks,
Tony


> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch, 
> HDFS-9493.003.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-29 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074430#comment-15074430
 ] 

Tony Wu commented on HDFS-9493:
---

The test failures are not related to this patch as this patch only affects 
TestMetaSave, witch passes.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch, 
> HDFS-9493.003.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-18 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9493:
--
Attachment: HDFS-9493.003.patch

In v3 patch:
* Addressed [~eddyxu]'s review comments:
** Dropped grabbing read lock for {{FSNamesystem}}
** Renamed helper function in test to {{stopDatanodeAndWait()}}
** Reduced {{dfs.namenode.stale.datanode.interval}} to 5 seconds

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch, 
> HDFS-9493.003.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-18 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065039#comment-15065039
 ] 

Tony Wu commented on HDFS-9493:
---

Hi [~eddyxu],

Thanks a lot for reviewing my patch. I have incorporated your comments in the 
next patch. One note on {{dfs.namenode.stale.datanode.interval}}, I think 
reducing its value may not be necessary as 
{{BlockManagerTestUtil#noticeDeadDatanode()}} would make the NN "notice" the DN 
is dead immediately. I updated the parameter to be 5 seconds in case I'm 
missing something.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-14 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057224#comment-15057224
 ] 

Tony Wu commented on HDFS-9493:
---

The failed tests are not related to the patch. As the patch only updated 
TestMetaSave.java with a new helper function no other tests use.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-14 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056768#comment-15056768
 ] 

Tony Wu commented on HDFS-9493:
---

Hi [~liuml07],

Thanks for your detailed comments. I have confirmed problem 2 & 3 after 
revisiting the test code.

For problem 3, I mis-read the {{@BeforeClass}} and {{@AfterClass}} tags, 
thinking {{MiniDFSCluster}} will be torn down and rebuilt for every test. That 
is not the case here. Instead as you said by the time {{testMetaSave()}} runs 
{{testMetaSaveAfterDelete()}} would have already removed the DN and the 
{{cluster.stopDataNode()}} call in the second test is essentially a no-op. 
Looks like the tests just happened to be working. The {{setup}} and 
{{tearDown}} functions should have been executed before and after each test 
case.

For problem 1 and 2, to further reduce the wait time for this unit test, 
{{BlockManagerTestUtil#noticeDeadDatanode()}} will be useful. This helper 
function will set the DN to be dead right away instead of waiting for HB 
timeout. 

I will rework the current patch and incorporate the findings above.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-14 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9493:
--
Attachment: HDFS-9493.002.patch

In v2 patch:
* {{Use BlockManagerTestUtil#noticeDeadDatanode()}} to reduce the test run 
time. After this call the NN will declare the DN dead right away, rather than 
waiting for HB timeout.
* Create a helper function {{stopDnAndWaitForNnToRemoveIt()}} for test cases to 
stop a DN and wait for it to be removed by DN.
* Create a new {{MiniDFSCluster}} for every test case.

Verified the {{testMetaSave}} runs fine on OSX & Linux (CentOS). Verified the 
run time reduced from 30+ seconds to 17 seconds.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch, HDFS-9493.002.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-11 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu reassigned HDFS-9493:
-

Assignee: Tony Wu

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-11 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053953#comment-15053953
 ] 

Tony Wu commented on HDFS-9493:
---

Hi [~liuml07], I would like to work on fixing this test.

Did some analysis on the failure by printing out the metasave content. Turns 
out the metasave output for the current test contains 2 Datanodes:
{code}
metasave out: 1 files and directories, 0 blocks = 1 total filesystem objects
metasave out: Live Datanodes: 1
metasave out: Dead Datanodes: 1
metasave out: Metasave: Blocks waiting for replication: 0
metasave out: Mis-replicated blocks that have been postponed:
metasave out: Metasave: Blocks being replicated: 0
metasave out: Metasave: Blocks 4 waiting deletion from 2 datanodes.
metasave out: 127.0.0.1:53465
metasave out: LightWeightHashSet(size=2, modification=2, entries.length=16)
metasave out: 127.0.0.1:53469
metasave out: LightWeightHashSet(size=2, modification=2, entries.length=16)
metasave out: Metasave: Number of datanodes: 2
metasave out: 127.0.0.1:53465 IN 998093619200(929.55 GB) 10270(10.03 KB) 0.00% 
882663514112(822.04 GB) 0(0 B) 0(0 B) 100.00% 0(0 B) Fri Dec 11 17:48:41 PST 
2015
metasave out: 127.0.0.1:53469 IN 998093619200(929.55 GB) 8192(8 KB) 0.00% 
882663825408(822.04 GB) 0(0 B) 0(0 B) 100.00% 0(0 B) Fri Dec 11 17:48:26 PST 
2015
{code}

This leads me to believe the following wait time was not long enough: 
{code:java}
// wait for namenode to discover that a datanode is dead
Thread.sleep(15000);
{code}

After increasing the sleep time to 30 seconds, the test was able to pass 
consistently.

The invalid bock count shown in {{Block x waiting deletion...}} statement is 
updated by {{blockManager.removeBlocksAssociatedTo()}}, which is called by 
{{DatanodeManager#removeDeadDatanode()}}. This only happens at 
{{HeartbeatManager#heartbeatCheck()}}. Using sleep may not be the best way to 
ensure the Datanode is deleted by Namenode.

I will upload a patch with a more robust way of waiting for the Datanode to be 
removed, instead of relying on {{Thread.sleep()}}.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-11 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9493:
--
Attachment: HDFS-9493.001.patch

In v1 patch:
* Add a helper function {{BlockManagerTestUtil.isDatanodeRemoved()}} to check 
if DN is removed by NN.
* Update TestMetaSave#testMetasaveAfterDelete() to use the helper function.

Verified on OSX and Linux (CentOS 6) the test passes consistently.

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9493) Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk

2015-12-11 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9493:
--
Status: Patch Available  (was: Open)

> Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk
> ---
>
> Key: HDFS-9493
> URL: https://issues.apache.org/jira/browse/HDFS-9493
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Mingliang Liu
>Assignee: Tony Wu
> Attachments: HDFS-9493.001.patch
>
>
> Tested in both Gentoo Linux and Mac.
> {quote}
> ---
>  T E S T S
> ---
> Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec 
> <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave
> testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave)  
> Time elapsed: 15.318 sec  <<< FAILURE!
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-04 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042448#comment-15042448
 ] 

Tony Wu commented on HDFS-9491:
---

Hi [~eddyxu],

I just ran {{TestDecommission}} with the patch with JDK 8 on OSX, the test 
passes without error.

The failed test {{TestDecommission}} is not related to the patch. Please see 
the error log from Jenkins is below:
{code}
Tests run: 19, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 160.801 sec 
<<< FAILURE! - in org.apache.hadoop.hdfs.TestDecommission
testDecommissionWithOpenfile(org.apache.hadoop.hdfs.TestDecommission)  Time 
elapsed: 1.753 sec  <<< ERROR!
java.lang.RuntimeException: Error while running command to get file permissions 
: ExitCodeException exitCode=127: /bin/ls: error while loading shared 
libraries: libc.so.6: failed to map segment from shared object: Permission 
denied

at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
at org.apache.hadoop.util.Shell.run(Shell.java:838)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1211)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1193)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1081)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:702)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:677)
at 
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:155)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:172)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2459)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2501)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2484)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2376)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1592)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:844)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:482)
at 
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:441)
at 
org.apache.hadoop.hdfs.TestDecommission.startCluster(TestDecommission.java:334)
at 
org.apache.hadoop.hdfs.TestDecommission.testDecommissionWithOpenfile(TestDecommission.java:830)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)

at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:742)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:677)
at 
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:155)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:172)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2459)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2501)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2484)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2376)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1592)
at 
org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:844)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:482)
at 
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:441)
at 
org.apache.hadoop.hdfs.TestDecommission.startCluster(TestDecommission.java:334)
at 
org.apache.hadoop.hdfs.TestDecommission.testDecommissionWithOpenfile(TestDecommission.java:830)
{code}

The test case in {{TestDecommission}} that 

[jira] [Commented] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-04 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041906#comment-15041906
 ] 

Tony Wu commented on HDFS-9490:
---

Thanks a lot [~eddyxu], for reviewing and providing comments.

> MiniDFSCluster should change block generation stamp via FsDatasetTestUtils
> --
>
> Key: HDFS-9490
> URL: https://issues.apache.org/jira/browse/HDFS-9490
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9490.001.patch, HDFS-9490.002.patch
>
>
> {{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
> file to update the generation stamp. This depends on file based {{FsDataset}}.
> We can abstract the change generation stamp operation in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-04 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9491:
--
Attachment: HDFS-9491.002.patch

In v2 patch:
* Rebased to latest trunk.
* Verified the build completes on OSX & Linux.
* Verified the following tests (they all use the modified API) pass on OSX & 
Linux: TestDecommission, TestRBWBlockInvalidation, TestDNFencing, 
TestHASafeMode, TestLazyWriter.

> Tests should get the number of pending async delets via FsDatasetTestUtils
> --
>
> Key: HDFS-9491
> URL: https://issues.apache.org/jira/browse/HDFS-9491
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9491.001.patch, HDFS-9491.002.patch
>
>
> A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to 
> retrieve the number of pending async deletions. It internally calls 
> {{FsDatasetTestUtil#getPendingAsyncDeletions}}:
> {code:java}
> public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
> return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
> }
> {code}
> This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
> However {{FsDataset}} is pluggable and can have other implementations.
> We can abstract getting the number of async deletions in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-04 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042017#comment-15042017
 ] 

Tony Wu commented on HDFS-9491:
---

Hi [~eddyxu],

I Just rebased the patch on latest trunk and ran a few tests. It was able to 
build correctly and the selected tests also ran without issue. Please take a 
look at the v2 patch.

Thanks,
Tony

> Tests should get the number of pending async delets via FsDatasetTestUtils
> --
>
> Key: HDFS-9491
> URL: https://issues.apache.org/jira/browse/HDFS-9491
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9491.001.patch, HDFS-9491.002.patch
>
>
> A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to 
> retrieve the number of pending async deletions. It internally calls 
> {{FsDatasetTestUtil#getPendingAsyncDeletions}}:
> {code:java}
> public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
> return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
> }
> {code}
> This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
> However {{FsDataset}} is pluggable and can have other implementations.
> We can abstract getting the number of async deletions in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-02 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036109#comment-15036109
 ] 

Tony Wu commented on HDFS-9490:
---

Hi [~eddyxu], 

Thanks a lot for the quick review. I have updated the patch to address your 
comments. Please take a look at the v2 patch.

Regards,
Tony

> MiniDFSCluster should change block generation stamp via FsDatasetTestUtils
> --
>
> Key: HDFS-9490
> URL: https://issues.apache.org/jira/browse/HDFS-9490
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9490.001.patch, HDFS-9490.002.patch
>
>
> {{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
> file to update the generation stamp. This depends on file based {{FsDataset}}.
> We can abstract the change generation stamp operation in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-02 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9490:
--
Attachment: HDFS-9490.002.patch

In v2 patch:
* Addressed [~eddyxu]'s review comments by changing 
{{changeStoredGenerationStamp}} to be {{void}}.
* Updated relevant functions and tests for the change.

> MiniDFSCluster should change block generation stamp via FsDatasetTestUtils
> --
>
> Key: HDFS-9490
> URL: https://issues.apache.org/jira/browse/HDFS-9490
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9490.001.patch, HDFS-9490.002.patch
>
>
> {{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
> file to update the generation stamp. This depends on file based {{FsDataset}}.
> We can abstract the change generation stamp operation in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-02 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036754#comment-15036754
 ] 

Tony Wu commented on HDFS-9490:
---

The failed tests are not related to this patch. Only 
{{TestPendingCorruptDnMessages}} and {{TestNameNodeMetadataConsistency}} uses 
the updated {{MiniDFSCluster#changeGenStampOfBlock}}. Both of these tests pass 
fine.

Manually ran the failed tests with jdk 1.8 on OSX and they all pass.

The failed JDK 1.7 tests both suffer from permission denied error (might be a 
test system issue):
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testListCacheDirectives:
{{Error while running command to get file permissions : ExitCodeException 
exitCode=127: /bin/ls: error while loading shared libraries: libc.so.6: failed 
to map segment from shared object: Permission denied}}
org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testListCacheDirectives:
{{bash: error while loading shared libraries: libdl.so.2: failed to map segment 
from shared object: Permission denied}}

The failed JDK 1.8 tests : 
TestSeveralNameNodes is tracked by HDFS-9376


> MiniDFSCluster should change block generation stamp via FsDatasetTestUtils
> --
>
> Key: HDFS-9490
> URL: https://issues.apache.org/jira/browse/HDFS-9490
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9490.001.patch, HDFS-9490.002.patch
>
>
> {{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
> file to update the generation stamp. This depends on file based {{FsDataset}}.
> We can abstract the change generation stamp operation in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035142#comment-15035142
 ] 

Tony Wu commented on HDFS-9491:
---

Manually ran the failed tests: TestWriteReadStripedFile, 
TestRenameWithSnapshots & TestDFSStripedOutputStreamWithFailure130 with JDK 1.8 
and 1.7 on OSX. All tests pass without error. None of the failed tests use 
{{getPendingAsyncDeletions}} so it should not be related to the patch.

ASF License warning is also not related to the patch as it does not include any 
license related change.

> Tests should get the number of pending async delets via FsDatasetTestUtils
> --
>
> Key: HDFS-9491
> URL: https://issues.apache.org/jira/browse/HDFS-9491
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9491.001.patch
>
>
> A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to 
> retrieve the number of pending async deletions. It internally calls 
> {{FsDatasetTestUtil#getPendingAsyncDeletions}}:
> {code:java}
> public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
> return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
> }
> {code}
> This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
> However {{FsDataset}} is pluggable and can have other implementations.
> We can abstract getting the number of async deletions in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9490:
--
Attachment: HDFS-9490.001.patch

In v1 patch:
* Add {{FsDatasetTestUtils#changeStoredGenerationStamp}}.
* Update {{MiniDFSCluster#changeGenStampOfBlock}} to use the new API.

> MiniDFSCluster should change block generation stamp via FsDatasetTestUtils
> --
>
> Key: HDFS-9490
> URL: https://issues.apache.org/jira/browse/HDFS-9490
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9490.001.patch
>
>
> {{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
> file to update the generation stamp. This depends on file based {{FsDataset}}.
> We can abstract the change generation stamp operation in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9490:
-

 Summary: MiniDFSCluster should change block generation stamp via 
FsDatasetTestUtils
 Key: HDFS-9490
 URL: https://issues.apache.org/jira/browse/HDFS-9490
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Minor


{{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
file to update the generation stamp. This depends on file based {{FsDataset}}.

We can abstract the change generation stamp operation in {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9490) MiniDFSCluster should change block generation stamp via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9490:
--
Status: Patch Available  (was: Open)

> MiniDFSCluster should change block generation stamp via FsDatasetTestUtils
> --
>
> Key: HDFS-9490
> URL: https://issues.apache.org/jira/browse/HDFS-9490
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9490.001.patch
>
>
> {{MiniDFSCluster#changeGenStampOfBlock}} directly manipulates the block meta 
> file to update the generation stamp. This depends on file based {{FsDataset}}.
> We can abstract the change generation stamp operation in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9491:
-

 Summary: Tests should get the number of pending async delets via 
FsDatasetTestUtils
 Key: HDFS-9491
 URL: https://issues.apache.org/jira/browse/HDFS-9491
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Minor


A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to retrieve 
the number of pending async deletions. It internally calls 
{{FsDatasetTestUtil#getPendingAsyncDeletions}}:
{code:java}
public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
}
{code}
This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
However {{FsDataset}} is pluggable and can have other implementations.

We can abstract getting the number of async deletions in {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9491:
--
Target Version/s: 3.0.0, 2.8.0

> Tests should get the number of pending async delets via FsDatasetTestUtils
> --
>
> Key: HDFS-9491
> URL: https://issues.apache.org/jira/browse/HDFS-9491
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
>
> A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to 
> retrieve the number of pending async deletions. It internally calls 
> {{FsDatasetTestUtil#getPendingAsyncDeletions}}:
> {code:java}
> public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
> return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
> }
> {code}
> This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
> However {{FsDataset}} is pluggable and can have other implementations.
> We can abstract getting the number of async deletions in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9491:
--
Attachment: HDFS-9491.001.patch

In v1 patch:
* Add {{FsDatasetTestUtils#getPendingAsyncDeletions}}.
* Update relevant tests to use the new API.
* Removed old APIs.

> Tests should get the number of pending async delets via FsDatasetTestUtils
> --
>
> Key: HDFS-9491
> URL: https://issues.apache.org/jira/browse/HDFS-9491
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9491.001.patch
>
>
> A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to 
> retrieve the number of pending async deletions. It internally calls 
> {{FsDatasetTestUtil#getPendingAsyncDeletions}}:
> {code:java}
> public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
> return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
> }
> {code}
> This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
> However {{FsDataset}} is pluggable and can have other implementations.
> We can abstract getting the number of async deletions in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9491) Tests should get the number of pending async delets via FsDatasetTestUtils

2015-12-01 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9491:
--
Status: Patch Available  (was: Open)

> Tests should get the number of pending async delets via FsDatasetTestUtils
> --
>
> Key: HDFS-9491
> URL: https://issues.apache.org/jira/browse/HDFS-9491
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9491.001.patch
>
>
> A few unit tests use {{DataNodeTestUtils#getPendingAsyncDeletions}} to 
> retrieve the number of pending async deletions. It internally calls 
> {{FsDatasetTestUtil#getPendingAsyncDeletions}}:
> {code:java}
> public static long getPendingAsyncDeletions(FsDatasetSpi fsd) {
> return ((FsDatasetImpl)fsd).asyncDiskService.countPendingDeletions();
> }
> {code}
> This assumes {{FsDatasetImpl}} is (the only implementation of) {{FsDataset}}. 
> However {{FsDataset}} is pluggable and can have other implementations.
> We can abstract getting the number of async deletions in 
> {{FsDatasetTestUtils}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-11-05 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991879#comment-14991879
 ] 

Tony Wu commented on HDFS-9282:
---

The failed test {[TestDeleteRace}} is not relevant to the change. Also manually 
verified the test passes on a Linux machine.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch, HDFS-9282.002.patch, 
> HDFS-9282.003.patch, HDFS-9282.004.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-11-05 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992724#comment-14992724
 ] 

Tony Wu commented on HDFS-9236:
---

Looked at the failed tests and none are related to block recovery. Also 
manually ran the failed tests against latest code (on Linux, JDK1.7), all 
passes without error.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch, 
> HDFS-9236.006.patch, HDFS-9236.007.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-11-04 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990521#comment-14990521
 ] 

Tony Wu commented on HDFS-9282:
---

Thanks [~eddyxu] for catching this. I have incorporated the comments in the 
latest patch.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch, HDFS-9282.002.patch, 
> HDFS-9282.003.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-11-04 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9282:
--
Attachment: HDFS-9282.004.patch

In v4 patch:
* Addressed [~eddyxu]'s review comments (use {{try-with-resources}} to avoid 
leaking resources).

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch, HDFS-9282.002.patch, 
> HDFS-9282.003.patch, HDFS-9282.004.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-11-04 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990215#comment-14990215
 ] 

Tony Wu commented on HDFS-9236:
---

Thanks a lot [~yzhangal] for your comments. I incorporated them into the new 
patch.
I added the debug logs but kept the positive logic for determining which 
replica info to add to syncList in existing code/patch. IMO the positive logic 
is easier to read/understand.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch, 
> HDFS-9236.006.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-11-04 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.007.patch

In v7 patch:
* Addressed [~yzhangal]'s review comments.
* Update the test case.
* Add a {{toString()}} method to pretty print {{ReplicaRecoveryInfo}}.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch, 
> HDFS-9236.006.patch, HDFS-9236.007.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-11-03 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.006.patch

In v6 patch:
* Address [~walter.k.su]'s comment by excluding RUR replicas from syncList. 
{{syncBlock()}} now will work on a clean syncList containing only replicas that 
will be used for recovery.
* Converted the check in previous patches for {{Long.MAX_VALUE}} to an assert.
* Reworked the test case.
* Add some comments.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch, 
> HDFS-9236.006.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9363) Add fetchReplica() to FsDatasetTestUtils()

2015-11-02 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9363:
-

 Summary: Add fetchReplica() to FsDatasetTestUtils()
 Key: HDFS-9363
 URL: https://issues.apache.org/jira/browse/HDFS-9363
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Minor


{{FsDatasetTestUtils()}} abstracts away the details in {{FsDataset}} to allow 
writing generic tests regardless of underlying {{FsDataset}} implementations. 
We can add a {{fetchReplica()}} method to allow some HDFS tests to avoid using 
{{FsDatasetTestUtil#fetchReplicaInfo()}}, which assumes FsDatasetImpl is the 
only implementation of FsDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9363) Add fetchReplica() to FsDatasetTestUtils()

2015-11-02 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9363:
--
Attachment: HDFS-9363.001.patch

In v1 patch:
* Add {{fetchReplica()}} to {{FsDatasetTestUtils()}}.
* Modify {{TestInterDatanodeProtocol}} and {{TestPipelines}} to use the new API.

> Add fetchReplica() to FsDatasetTestUtils()
> --
>
> Key: HDFS-9363
> URL: https://issues.apache.org/jira/browse/HDFS-9363
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9363.001.patch
>
>
> {{FsDatasetTestUtils()}} abstracts away the details in {{FsDataset}} to allow 
> writing generic tests regardless of underlying {{FsDataset}} implementations. 
> We can add a {{fetchReplica()}} method to allow some HDFS tests to avoid 
> using {{FsDatasetTestUtil#fetchReplicaInfo()}}, which assumes FsDatasetImpl 
> is the only implementation of FsDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9363) Add fetchReplica() to FsDatasetTestUtils()

2015-11-02 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9363:
--
Status: Patch Available  (was: Open)

> Add fetchReplica() to FsDatasetTestUtils()
> --
>
> Key: HDFS-9363
> URL: https://issues.apache.org/jira/browse/HDFS-9363
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9363.001.patch
>
>
> {{FsDatasetTestUtils()}} abstracts away the details in {{FsDataset}} to allow 
> writing generic tests regardless of underlying {{FsDataset}} implementations. 
> We can add a {{fetchReplica()}} method to allow some HDFS tests to avoid 
> using {{FsDatasetTestUtil#fetchReplicaInfo()}}, which assumes FsDatasetImpl 
> is the only implementation of FsDataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-31 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984040#comment-14984040
 ] 

Tony Wu commented on HDFS-9236:
---

Thanks [~walter.k.su] and [~yzhangal] for your comments. I'll post a new patch 
which will exclude RURs from syncList.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() and deleteMeta() to MiniDFSCluster

2015-10-30 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Attachment: HDFS-9308.003.patch

In v3 patch:
* Addressed [~eddyxu]'s review comments.

*Please note:*
{{TestCrcCorruption#testCrcCorruption}} seems to be broken by 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f for HDFS-4937 (did a quick binary 
search to find this change). I will open a separate JIRA for the test failure.

Manually applied the patch to before the change and verified it works:
{code}
---
 T E S T S
---
Running org.apache.hadoop.hdfs.TestCrcCorruption
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.56 sec - in 
org.apache.hadoop.hdfs.TestCrcCorruption
Running org.apache.hadoop.hdfs.TestLeaseRecovery
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.752 sec - in 
org.apache.hadoop.hdfs.TestLeaseRecovery

Results :

Tests run: 7, Failures: 0, Errors: 0, Skipped: 0
{code}

> Add truncateMeta() and deleteMeta() to MiniDFSCluster
> -
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch, HDFS-9308.002.patch, 
> HDFS-9308.003.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} and 
> {{deleteMeta()}} method in MiniDFSCluster to allow truncation of metadata 
> files on DataNodes without writing code that's specific to underling file 
> system. {{FsDatasetTestUtils#truncateMeta()}} is already implemented by 
> HDFS-9188 and cam be exposed easily in {{MiniDFSCluster}}.
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} and 
> {{TestCrcCorruption#testCrcCorruption}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-30 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982958#comment-14982958
 ] 

Tony Wu commented on HDFS-9236:
---

Thanks a lot for [~walter.k.su] and [~zhz]'s comments!

[~walter.k.su], DN throws {{RecoveryInProgressException}} only when the 
received recovery ID is smaller than the existing RUR recovery ID:
{code:java}
  static ReplicaRecoveryInfo initReplicaRecovery(String bpid, ReplicaMap map,
  Block block, long recoveryId, long xceiverStopTimeout) throws IOException 
{
...
final ReplicaUnderRecovery rur;
if (replica.getState() == ReplicaState.RUR) {
  rur = (ReplicaUnderRecovery)replica;
  if (rur.getRecoveryID() >= recoveryId) {
throw new RecoveryInProgressException(
"rur.getRecoveryID() >= recoveryId = " + recoveryId
+ ", block=" + block + ", rur=" + rur);
  }
  final long oldRecoveryID = rur.getRecoveryID();
  rur.setRecoveryID(recoveryId);
  LOG.info("initReplicaRecovery: update recovery id for " + block
  + " from " + oldRecoveryID + " to " + recoveryId);
}
}
{code}

So if the DN has a block that is already in RUR, and a new block recovery 
starts (with larger recovery ID), the DN does not throw 
{{RecoveryInProgressException}}.

The patch is focused on what happens after this point, where a buggy DN (or a 
unknown corner case causes DN) might report RUR as the replica's original state.

I think your suggestion of moving to check out of {{syncBlock()}} and into 
{{initReplicaRecovery()}} make sense. I implemented a check to simply exclude 
the replicas whose original state is >= RUR (they won't be used for recovery 
anyways). But the issue with this is that we might end up with an empty 
{{syncList}} and incorrectly tell NN to drop this block. I think the current 
place for the check in the patch is probably the safest. Please let me know 
what you think.

Again thanks a lot for taking the time to look at my patch.



> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is 

[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-29 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980912#comment-14980912
 ] 

Tony Wu commented on HDFS-9236:
---

[~yzhangal] Thanks a lot for looking at the patch.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-29 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981071#comment-14981071
 ] 

Tony Wu commented on HDFS-9236:
---

Hi [~yzhangal], I believe HDFS-9255 has moved block recovery related code to a 
different location. I will rebase my patch and upload a new one shortly.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-29 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.004.patch

In v4 patch:
* Rebased to latest trunk (moved the changes to new file 
{{BlockRecoveryWorker.java}}).

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-29 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981419#comment-14981419
 ] 

Tony Wu commented on HDFS-9236:
---

Hi [~liuml07], 

Thanks a lot for your comment. I debated about having an assert as well and 
think it has a few disadvantages (please correct me if I'm wrong):

# Assert can be disabled at runtime.
# Assert message is only visible on DN where the exception can propagate back 
to NN (and also visible on DN).
# Assert would have stopped the DN process, which seems to be an overkill.

Given these reasons I think throwing an exception is the better choice.

Tony 

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-29 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9282:
--
Attachment: HDFS-9282.003.patch

In v3 patch:
* Rebased patch on latest trunk.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch, HDFS-9282.002.patch, 
> HDFS-9282.003.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9331) Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem entirely allocated for DFS use

2015-10-29 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981671#comment-14981671
 ] 

Tony Wu commented on HDFS-9331:
---

The test failures are unrelated as the patch only changed 
{{TestNameNodeMXBean#testNameNodeMXBeanInfo}}, which passes.

> Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem 
> entirely allocated for DFS use
> ---
>
> Key: HDFS-9331
> URL: https://issues.apache.org/jira/browse/HDFS-9331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9331.001.patch
>
>
> {{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS 
> size. The nonDFS size is defined as:
> {quote}
> The space that is not used by HDFS. For instance, once you format a new disk 
> to ext4, certain space is used for "lost-and-found" directory and ext4 
> metadata.
> {quote}
> It will be possible to fully allocate all spaces in a filesystem for DFS use. 
> In which case the nonDFS size will be zero. We can relax the check in the 
> test to account for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-29 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981585#comment-14981585
 ] 

Tony Wu commented on HDFS-9236:
---

Thanks for clarifying. I'll post a updated patch shortly. 

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-29 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.005.patch

In v5 patch:
* Addressed [~liuml07]'s comment by updating the test case.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch, HDFS-9236.004.patch, HDFS-9236.005.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9331) Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem entirely allocated for DFS use

2015-10-28 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9331:
--
Status: Patch Available  (was: Open)

> Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem 
> entirely allocated for DFS use
> ---
>
> Key: HDFS-9331
> URL: https://issues.apache.org/jira/browse/HDFS-9331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9331.001.patch
>
>
> {{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS 
> size. The nonDFS size is defined as:
> {quote}
> The space that is not used by HDFS. For instance, once you format a new disk 
> to ext4, certain space is used for "lost-and-found" directory and ext4 
> metadata.
> {quote}
> It will be possible to fully allocate all spaces in a filesystem for DFS use. 
> In which case the nonDFS size will be zero. We can relax the check in the 
> test to account for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9331) Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem entirely allocated for DFS use

2015-10-28 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9331:
--
Attachment: HDFS-9331.001.patch

In v1 patch:
* Relax the check in {{TestNameNodeMXBean#testNameNodeMXBeanInfo}} to allow 
nonDFS size to be 0.

> Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem 
> entirely allocated for DFS use
> ---
>
> Key: HDFS-9331
> URL: https://issues.apache.org/jira/browse/HDFS-9331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9331.001.patch
>
>
> {{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS 
> size. The nonDFS size is defined as:
> {quote}
> The space that is not used by HDFS. For instance, once you format a new disk 
> to ext4, certain space is used for "lost-and-found" directory and ext4 
> metadata.
> {quote}
> It will be possible to fully allocate all spaces in a filesystem for DFS use. 
> In which case the nonDFS size will be zero. We can relax the check in the 
> test to account for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9331) Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem entirely allocated for DFS use

2015-10-28 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9331:
-

 Summary: Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to 
account for filesystem entirely allocated for DFS use
 Key: HDFS-9331
 URL: https://issues.apache.org/jira/browse/HDFS-9331
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Trivial


{{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS size. 
The nonDFS size is defined as:
{quote}
The space that is not used by HDFS. For instance, once you format a new disk to 
ext4, certain space is used for "lost-and-found" directory and ext4 metadata.
{quote}

It will be possible to fully allocate all spaces in a partition/volume for DFS 
use. In which case the nonDFS size will be zero. We can relax the check in the 
test to account for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9331) Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem entirely allocated for DFS use

2015-10-28 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9331:
--
Description: 
{{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS size. 
The nonDFS size is defined as:
{quote}
The space that is not used by HDFS. For instance, once you format a new disk to 
ext4, certain space is used for "lost-and-found" directory and ext4 metadata.
{quote}

It will be possible to fully allocate all spaces in a filesystem for DFS use. 
In which case the nonDFS size will be zero. We can relax the check in the test 
to account for this case.

  was:
{{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS size. 
The nonDFS size is defined as:
{quote}
The space that is not used by HDFS. For instance, once you format a new disk to 
ext4, certain space is used for "lost-and-found" directory and ext4 metadata.
{quote}

It will be possible to fully allocate all spaces in a partition/volume for DFS 
use. In which case the nonDFS size will be zero. We can relax the check in the 
test to account for this case.


> Modify TestNameNodeMXBean#testNameNodeMXBeanInfo() to account for filesystem 
> entirely allocated for DFS use
> ---
>
> Key: HDFS-9331
> URL: https://issues.apache.org/jira/browse/HDFS-9331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
>
> {{TestNameNodeMXBean#testNameNodeMXBeanInfo}} expects a none-zero nonDFS 
> size. The nonDFS size is defined as:
> {quote}
> The space that is not used by HDFS. For instance, once you format a new disk 
> to ext4, certain space is used for "lost-and-found" directory and ext4 
> metadata.
> {quote}
> It will be possible to fully allocate all spaces in a filesystem for DFS use. 
> In which case the nonDFS size will be zero. We can relax the check in the 
> test to account for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() and deleteMeta() to MiniDFSCluster

2015-10-27 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Attachment: HDFS-9308.002.patch

* Add {{truncateMeta()}} and {{deleteMeta()}} method to truncate/delete meta 
data files on DataNodes.
* Modify {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} to use the new 
APIs.
* Enhance {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} to check the 
file size after lease recovery is complete. Truncating the metadata file on 
DataNodes will effectively reduce the file size after lease recovery. The test 
should verify the new file size as well.
* Modify {{TestCrcCorruption#thistest}} to use the new APIs.

> Add truncateMeta() and deleteMeta() to MiniDFSCluster
> -
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch, HDFS-9308.002.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} and 
> {{deleteMeta()}} method in MiniDFSCluster to allow truncation of metadata 
> files on DataNodes without writing code that's specific to underling file 
> system. {{FsDatasetTestUtils#truncateMeta()}} is already implemented by 
> HDFS-9188 and cam be exposed easily in {{MiniDFSCluster}}.
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} and 
> {{TestCrcCorruption#testCrcCorruption}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() and deleteMeta() to MiniDFSCluster

2015-10-27 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Summary: Add truncateMeta() and deleteMeta() to MiniDFSCluster  (was: Add 
truncateMeta() to MiniDFSCluster)

> Add truncateMeta() and deleteMeta() to MiniDFSCluster
> -
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} method in 
> MiniDFSCluster to allow truncation of metadata files on DataNodes without 
> writing code that's specific to underling file system. 
> {{FsDatasetTestUtils#truncateMeta()}} is already implemented by HDFS-9188 and 
> cam be exposed easily in {{MiniDFSCluster}}.
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() and deleteMeta() to MiniDFSCluster

2015-10-27 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Description: 
HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
file filesystem agnostic. There should also be a {{truncateMeta()}} and 
{{deleteMeta()}} method in MiniDFSCluster to allow truncation of metadata files 
on DataNodes without writing code that's specific to underling file system. 
{{FsDatasetTestUtils#truncateMeta()}} is already implemented by HDFS-9188 and 
cam be exposed easily in {{MiniDFSCluster}}.

This will be useful for tests such as 
{{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} and 
{{TestCrcCorruption#testCrcCorruption}}.

  was:
HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
file filesystem agnostic. There should also be a {{truncateMeta()}} method in 
MiniDFSCluster to allow truncation of metadata files on DataNodes without 
writing code that's specific to underling file system. 
{{FsDatasetTestUtils#truncateMeta()}} is already implemented by HDFS-9188 and 
cam be exposed easily in {{MiniDFSCluster}}.

This will be useful for tests such as 
{{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.


> Add truncateMeta() and deleteMeta() to MiniDFSCluster
> -
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} and 
> {{deleteMeta()}} method in MiniDFSCluster to allow truncation of metadata 
> files on DataNodes without writing code that's specific to underling file 
> system. {{FsDatasetTestUtils#truncateMeta()}} is already implemented by 
> HDFS-9188 and cam be exposed easily in {{MiniDFSCluster}}.
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} and 
> {{TestCrcCorruption#testCrcCorruption}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() to MiniDFSCluster

2015-10-26 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Attachment: HDFS-9308.001.patch

In this patch:
* Add {{truncateMeta()}} method to truncate meta data files on DataNodes.
* Modify {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} to use the new 
API.
* Enhance {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}} to check the 
file size after lease recovery is complete. Truncating the metadata file on 
DataNodes will effectively reduce the file size after lease recovery. The test 
should verify the new file size as well.

> Add truncateMeta() to MiniDFSCluster
> 
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} method to 
> allow truncation of metadata files on DataNodes without writing code that's 
> specific to underling file system. 
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() to MiniDFSCluster

2015-10-26 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Status: Patch Available  (was: Open)

> Add truncateMeta() to MiniDFSCluster
> 
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} method to 
> allow truncation of metadata files on DataNodes without writing code that's 
> specific to underling file system. 
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9308) Add truncateMeta() to MiniDFSCluster

2015-10-26 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9308:
-

 Summary: Add truncateMeta() to MiniDFSCluster
 Key: HDFS-9308
 URL: https://issues.apache.org/jira/browse/HDFS-9308
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Minor


HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
file filesystem agnostic. There should also be a {{truncateMeta()}} method to 
allow truncation of metadata files on DataNodes without writing code that's 
specific to underling file system. 

This will be useful for tests such as 
{{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-26 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975193#comment-14975193
 ] 

Tony Wu commented on HDFS-9282:
---

Manually verified hadoop.hdfs.server.datanode.TestDirectoryScanner runs without 
error. The patch does not change anything related to this test.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch, HDFS-9282.002.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2015-10-26 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975231#comment-14975231
 ] 

Tony Wu commented on HDFS-9290:
---

HI [~kihwal],
Thanks for taking the time to manually run these tests. I didn't know Hadoop QA 
does not kick off test runs for client changes. Will make sure I include my 
manual run results in the future.
Thanks,
Tony

> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Blocker
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9290.001.patch, HDFS-9290.002.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
> compatibility for older client is handled by the new NameNode (protobuf). 
> Newer client's {{append()}} call does not work with older NameNodes. One will 
> run into an exception like the following:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
> at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
> {code}
> The cause is that the new client code is expecting both the last block and 
> file info in the same RPC but the old NameNode only replied with the first. 
> The exception itself does not reflect this and one will have to look at the 
> HDFS source code to really understand what happened.
> We can have the client detect it's talking to a old NameNode and send an 
> extra {{getFileInfo()}} RPC. Or we should improve the exception being thrown 
> to accurately reflect the cause of failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9308) Add truncateMeta() to MiniDFSCluster

2015-10-26 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9308:
--
Description: 
HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
file filesystem agnostic. There should also be a {{truncateMeta()}} method in 
MiniDFSCluster to allow truncation of metadata files on DataNodes without 
writing code that's specific to underling file system. 
{{FsDatasetTestUtils#truncateMeta()}} is already implemented by HDFS-9188 and 
cam be exposed easily in {{MiniDFSCluster}}.

This will be useful for tests such as 
{{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.

  was:
HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
file filesystem agnostic. There should also be a {{truncateMeta()}} method to 
allow truncation of metadata files on DataNodes without writing code that's 
specific to underling file system. 

This will be useful for tests such as 
{{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.


> Add truncateMeta() to MiniDFSCluster
> 
>
> Key: HDFS-9308
> URL: https://issues.apache.org/jira/browse/HDFS-9308
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9308.001.patch
>
>
> HDFS-9188 introduced {{corruptMeta()}} method to make corrupting the metadata 
> file filesystem agnostic. There should also be a {{truncateMeta()}} method in 
> MiniDFSCluster to allow truncation of metadata files on DataNodes without 
> writing code that's specific to underling file system. 
> {{FsDatasetTestUtils#truncateMeta()}} is already implemented by HDFS-9188 and 
> cam be exposed easily in {{MiniDFSCluster}}.
> This will be useful for tests such as 
> {{TestLeaseRecovery#testBlockRecoveryWithLessMetafile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-26 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9282:
--
Attachment: HDFS-9282.002.patch

Addressed [~eddyxu]'s review comments.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch, HDFS-9282.002.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-26 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974469#comment-14974469
 ] 

Tony Wu commented on HDFS-9282:
---

Hi [~eddyxu],

Thank you very much for the detailed review. I have addressed all of your 
comments in the new patch. Please take a look.

Regarding your comment:
* {{FsDatasetTestUtils#getNumOfDataDirs()}} should be renamed as 
{{getDefaultNumOfDataDirs()}}. I was thinking the method in TestUtils may not 
always return the default value. Instead it may choose to return some 
calculated value. But this is probably over-designing the API and I have 
changed the name as you suggested.

Thanks,
Tony

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9297) Update TestBlockMissingException to use corruptBlockOnDataNodesByDeletingBlockFile()

2015-10-23 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971546#comment-14971546
 ] 

Tony Wu commented on HDFS-9297:
---

The failed tests are not related to this change.

> Update TestBlockMissingException to use 
> corruptBlockOnDataNodesByDeletingBlockFile()
> 
>
> Key: HDFS-9297
> URL: https://issues.apache.org/jira/browse/HDFS-9297
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9297.001.patch
>
>
> TestBlockMissingException uses its own function to corrupt a block by 
> deleting all its block files. HDFS-7235 introduced a helper function 
> {{corruptBlockOnDataNodesByDeletingBlockFile()}} that does exactly the same 
> thing. We can update this test to use the helper function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9297) Update TestBlockMissingException to use corruptBlockOnDataNodesByDeletingBlockFile()

2015-10-23 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9297:
--
Status: Patch Available  (was: Open)

> Update TestBlockMissingException to use 
> corruptBlockOnDataNodesByDeletingBlockFile()
> 
>
> Key: HDFS-9297
> URL: https://issues.apache.org/jira/browse/HDFS-9297
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9297.001.patch
>
>
> TestBlockMissingException uses its own function to corrupt a block by 
> deleting all its block files. HDFS-7235 introduced a helper function 
> {{corruptBlockOnDataNodesByDeletingBlockFile()}} that does exactly the same 
> thing. We can update this test to use the helper function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9297) Update TestBlockMissingException to use corruptBlockOnDataNodesByDeletingBlockFile()

2015-10-23 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9297:
--
Attachment: HDFS-9297.001.patch

In this patch:
* Use {{corruptBlockOnDataNodesByDeletingBlockFile()}} to corrupt a block by 
removing all block files.
* Removed the test's own implementation of the same function.

> Update TestBlockMissingException to use 
> corruptBlockOnDataNodesByDeletingBlockFile()
> 
>
> Key: HDFS-9297
> URL: https://issues.apache.org/jira/browse/HDFS-9297
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9297.001.patch
>
>
> TestBlockMissingException uses its own function to corrupt a block by 
> deleting all its block files. HDFS-7235 introduced a helper function 
> {{corruptBlockOnDataNodesByDeletingBlockFile()}} that does exactly the same 
> thing. We can update this test to use the helper function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9297) Update TestBlockMissingException to use corruptBlockOnDataNodesByDeletingBlockFile()

2015-10-23 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9297:
-

 Summary: Update TestBlockMissingException to use 
corruptBlockOnDataNodesByDeletingBlockFile()
 Key: HDFS-9297
 URL: https://issues.apache.org/jira/browse/HDFS-9297
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Trivial


TestBlockMissingException uses its own function to corrupt a block by deleting 
all its block files. HDFS-7235 introduced a helper function 
{{corruptBlockOnDataNodesByDeletingBlockFile()}} that does exactly the same 
thing. We can update this test to use the helper function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2015-10-23 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9290:
--
Attachment: HDFS-9290.002.patch

Hi [~kihwal],

Thanks a lot for taking the time to look at the patch! I have addressed your 
comment by changing the log level. Please take a look and let me know if you 
have any other comments.

Regards,
Tony Wu

> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Blocker
> Attachments: HDFS-9290.001.patch, HDFS-9290.002.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
> compatibility for older client is handled by the new NameNode (protobuf). 
> Newer client's {{append()}} call does not work with older NameNodes. One will 
> run into an exception like the following:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
> at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
> {code}
> The cause is that the new client code is expecting both the last block and 
> file info in the same RPC but the old NameNode only replied with the first. 
> The exception itself does not reflect this and one will have to look at the 
> HDFS source code to really understand what happened.
> We can have the client detect it's talking to a old NameNode and send an 
> extra {{getFileInfo()}} RPC. Or we should improve the exception being thrown 
> to accurately reflect the cause of failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2015-10-22 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9290:
--
Status: Patch Available  (was: Open)

> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9290.001.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
> compatibility for older client is handled by the new NameNode (protobuf). 
> Newer client's {{append()}} call does not work with older NameNodes. One will 
> run into an exception like the following:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
> at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
> {code}
> The cause is that the new client code is expecting both the last block and 
> file info in the same RPC but the old NameNode only replied with the first. 
> The exception itself does not reflect this cause and one will have to look at 
> the HDFS source code to really understand what happened.
> We can have the client detect it's talking to a old NameNode and send an 
> extra {{getFileInfo()}} RPC. At the very least we can improve the exception 
> being thrown to accurately reflect the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2015-10-22 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9290:
-

 Summary: DFSClient#callAppend() is not backward compatible for 
slightly older NameNodes
 Key: HDFS-9290
 URL: https://issues.apache.org/jira/browse/HDFS-9290
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Minor


HDFS-7210 combined 2 RPC calls used at file append into a single one. 
Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
compatibility for older client is handled by the new NameNode (protobuf). Newer 
client's {{append()}} call does not work with older NameNodes. One will run 
into an exception like the following:
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
at 
org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
at 
org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
at 
org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
{code}

The cause is that the new client code is expecting both the last block and file 
info in the same RPC but the old NameNode only replied with the first. The 
exception itself does not reflect this cause and one will have to look at the 
HDFS source code to really understand what happened.

We can have the client detect it's talking to a old NameNode and send an extra 
{{getFileInfo()}} RPC. At the very least we can improve the exception being 
thrown to accurately reflect the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2015-10-22 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9290:
--
Attachment: HDFS-9290.001.patch

In this patch:
* Detect that DFSClient is talking to an older NameNode and make an extra 
{{getFileInfo()}} RPC call to request file info.

> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9290.001.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
> compatibility for older client is handled by the new NameNode (protobuf). 
> Newer client's {{append()}} call does not work with older NameNodes. One will 
> run into an exception like the following:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
> at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
> {code}
> The cause is that the new client code is expecting both the last block and 
> file info in the same RPC but the old NameNode only replied with the first. 
> The exception itself does not reflect this cause and one will have to look at 
> the HDFS source code to really understand what happened.
> We can have the client detect it's talking to a old NameNode and send an 
> extra {{getFileInfo()}} RPC. At the very least we can improve the exception 
> being thrown to accurately reflect the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2015-10-22 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9290:
--
Description: 
HDFS-7210 combined 2 RPC calls used at file append into a single one. 
Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
compatibility for older client is handled by the new NameNode (protobuf). Newer 
client's {{append()}} call does not work with older NameNodes. One will run 
into an exception like the following:
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
at 
org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
at 
org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
at 
org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
{code}

The cause is that the new client code is expecting both the last block and file 
info in the same RPC but the old NameNode only replied with the first. The 
exception itself does not reflect this and one will have to look at the HDFS 
source code to really understand what happened.

We can have the client detect it's talking to a old NameNode and send an extra 
{{getFileInfo()}} RPC. Or we should improve the exception being thrown to 
accurately reflect the cause of failure.

  was:
HDFS-7210 combined 2 RPC calls used at file append into a single one. 
Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
compatibility for older client is handled by the new NameNode (protobuf). Newer 
client's {{append()}} call does not work with older NameNodes. One will run 
into an exception like the following:
{code:java}
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
at 
org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
at 
org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
at 
org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
{code}

The cause is that the new client code is expecting both the last block and file 
info in the same RPC but the old NameNode only replied with the first. The 
exception itself does not reflect this cause and one will have to look at the 
HDFS source code to really understand what happened.

We can have the client detect it's talking to a old NameNode and send an extra 
{{getFileInfo()}} RPC. At the very least we can improve the exception being 
thrown to accurately reflect the failure.


> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9290.001.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically 

[jira] [Commented] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-22 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969354#comment-14969354
 ] 

Tony Wu commented on HDFS-9282:
---

Manually reran the failed tests and they both pass without error.
Inspected the patch and did not find any newly added training spaces.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-21 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14967513#comment-14967513
 ] 

Tony Wu commented on HDFS-9236:
---

Hi [~yzhangal],

Could you take another look at the updated patch?

Thanks,
Tony

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-21 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9282:
--
Status: Patch Available  (was: Open)

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-21 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9282:
--
Attachment: HDFS-9282.001.patch

In this patch:
* Add getNumOfDataDirs() and getRawCapacity() interfaces.
* Have MiniDFSCluster automatically pick up the correct number of data 
directories.
* Updated a few tests where we have hard coded values.

> Make data directory count and storage raw capacity related tests 
> FsDataset-agnostic
> ---
>
> Key: HDFS-9282
> URL: https://issues.apache.org/jira/browse/HDFS-9282
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Minor
> Attachments: HDFS-9282.001.patch
>
>
> DFSMiniCluster and several tests have hard coded assumption of the underlying 
> storage having 2 data directories (volumes). As HDFS-9188 pointed out, with 
> new FsDataset implementations, these hard coded assumption about number of 
> data directories and raw capacities of storage may change as well.
> We need to extend FsDatasetTestUtils to provide:
> * Number of data directories of underlying storage per DataNode
> * Raw storage capacity of underlying storage per DataNode.
> * Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9282) Make data directory count and storage raw capacity related tests FsDataset-agnostic

2015-10-21 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9282:
-

 Summary: Make data directory count and storage raw capacity 
related tests FsDataset-agnostic
 Key: HDFS-9282
 URL: https://issues.apache.org/jira/browse/HDFS-9282
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: HDFS, test
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Minor


DFSMiniCluster and several tests have hard coded assumption of the underlying 
storage having 2 data directories (volumes). As HDFS-9188 pointed out, with new 
FsDataset implementations, these hard coded assumption about number of data 
directories and raw capacities of storage may change as well.

We need to extend FsDatasetTestUtils to provide:
* Number of data directories of underlying storage per DataNode
* Raw storage capacity of underlying storage per DataNode.
* Have MiniDFSCluster automatically pick up the correct values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-16 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961543#comment-14961543
 ] 

Tony Wu commented on HDFS-9236:
---

checksyle and pre-patch error are not related to this patch.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-15 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.002.patch

Removed block size check on NN.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-15 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959008#comment-14959008
 ] 

Tony Wu commented on HDFS-9236:
---

Thanks to [~yzhangal] for offline review and valuable comments! In summary:
* It is difficult come up with a block size limit to enforce on NN. Especially 
when considering HDFS allows different files to specify their own block size.
** I will remove the NN side change in the next patch. I would still like to 
investigate if we can enforce a per file block size check.
* The sanity check on DN is useful although the chance of hitting the error in 
a production cluster is small.




> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-15 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959253#comment-14959253
 ] 

Tony Wu commented on HDFS-9236:
---

Hi [~yzhangal],

Thanks a lot for looking at the patch. Regarding your comments:
1: This is already been addressed in patch 2.
2 - 4: I will address this in the next patch.

Regards,
Tony 

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-15 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.003.patch

Addressed [~yzhangal]'s review comments.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch, HDFS-9236.002.patch, 
> HDFS-9236.003.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9238) Update TestFileCreation#testLeaseExpireHardLimit() to avoid using DataNodeTestUtils#getFile()

2015-10-14 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957068#comment-14957068
 ] 

Tony Wu commented on HDFS-9238:
---

The failed tests are not related to this patch as the patch itself only touches 
a particular test which did not fail.

> Update TestFileCreation#testLeaseExpireHardLimit() to avoid using 
> DataNodeTestUtils#getFile()
> -
>
> Key: HDFS-9238
> URL: https://issues.apache.org/jira/browse/HDFS-9238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9238.001.patch
>
>
> TestFileCreation#testLeaseExpireHardLimit uses DataNodeTestUtils#getFile() to 
> open, read and verify blocks written on the DN. It’s better to use 
> getBlockInputStream() which does exactly the same thing but hides the detail 
> of getting the block file on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9236) Add sanity check for block size during block recovery

2015-10-13 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9236:
-

 Summary: Add sanity check for block size during block recovery
 Key: HDFS-9236
 URL: https://issues.apache.org/jira/browse/HDFS-9236
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu


Ran into an issue while running test against faulty data-node code. 

Currently in DataNode.java:
{code:java}
  /** Block synchronization */
  void syncBlock(RecoveringBlock rBlock,
 List syncList) throws IOException {
…

// Calculate the best available replica state.
ReplicaState bestState = ReplicaState.RWR;
…

// Calculate list of nodes that will participate in the recovery
// and the new block size
List participatingList = new ArrayList();
final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
-1, recoveryId);
switch(bestState) {
…
case RBW:
case RWR:
  long minLength = Long.MAX_VALUE;
  for(BlockRecord r : syncList) {
ReplicaState rState = r.rInfo.getOriginalReplicaState();
if(rState == bestState) {
  minLength = Math.min(minLength, r.rInfo.getNumBytes());
  participatingList.add(r);
}
  }
  newBlock.setNumBytes(minLength);
  break;
…
}
…
nn.commitBlockSynchronization(block,
newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
datanodes, storages);
  }
{code}

This code is called by the DN coordinating the block recovery. In the above 
case, it is possible for none of the rState (reported by DNs with copies of the 
replica being recovered) to match the bestState. This can either be caused by 
faulty DN code or stale/modified/corrupted files on DN. When this happens the 
DN will end up reporting the minLengh of Long.MAX_VALUE.

Unfortunately there is no check on the NN for replica length. See 
FSNamesystem.java:
{code:java}
  void commitBlockSynchronization(ExtendedBlock oldBlock,
  long newgenerationstamp, long newlength,
  boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
  String[] newtargetstorages) throws IOException {
…

  if (deleteblock) {
Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
boolean remove = iFile.removeLastBlock(blockToDel) != null;
if (remove) {
  blockManager.removeBlock(storedBlock);
}
  } else {
// update last block
if(!copyTruncate) {
  storedBlock.setGenerationStamp(newgenerationstamp);
  
  // XXX block length is updated without any check <<<

[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Summary: Missing sanity check for block size during block recovery  (was: 
Add sanity check for block size during block recovery)

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Attachment: HDFS-9236.001.patch

First patch.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-13 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955280#comment-14955280
 ] 

Tony Wu commented on HDFS-9236:
---

The path does:
* Add replica length check in syncBlock() so DN reports error instead of 
sending Long.MAX_VALUE to NN.
* Add replica length check on NN so it won't blindly update the replica length 
to a value larger than configured block size.
* Add extra debug logs to help trace the block recovery process.
* Add unit tests to verify the new exceptions.

I tested the patch with:
* org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
* org.apache.hadoop.hdfs.server.namenode.TestCommitBlockSynchronization
* org.apache.hadoop.hdfs.TestLeaseRecovery
* org.apache.hadoop.hdfs.TestLeaseRecovery2
* org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover: This, 
especially the test case testPipelineRecoveryStress is a good system test that 
stresses all parts in the lease/block recovery code path.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-13 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955818#comment-14955818
 ] 

Tony Wu commented on HDFS-9236:
---

All tests pass when manually run on OSX and Linux (CentOS 6.4) with latest 
trunk. It looks like the failures are not caused by this patch.

checkstyle seems to be complaining about file & functions being too long. They 
are also not caused by this patch.

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9238) Update TestFileCreation#testLeaseExpireHardLimit() to avoid using DataNodeTestUtils#getFile()

2015-10-13 Thread Tony Wu (JIRA)
Tony Wu created HDFS-9238:
-

 Summary: Update TestFileCreation#testLeaseExpireHardLimit() to 
avoid using DataNodeTestUtils#getFile()
 Key: HDFS-9238
 URL: https://issues.apache.org/jira/browse/HDFS-9238
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.1
Reporter: Tony Wu
Assignee: Tony Wu
Priority: Trivial


TestFileCreation#testLeaseExpireHardLimit uses DataNodeTestUtils#getFile() to 
open, read and verify blocks written on the DN. It’s better to use 
getBlockInputStream() which does exactly the same thing but hides the detail of 
getting the block file on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9238) Update TestFileCreation#testLeaseExpireHardLimit() to avoid using DataNodeTestUtils#getFile()

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9238:
--
Attachment: HDFS-9238.001.patch

Use getBlockInputStream() to read block on DN instead of using getFile() and 
manually read from file.

> Update TestFileCreation#testLeaseExpireHardLimit() to avoid using 
> DataNodeTestUtils#getFile()
> -
>
> Key: HDFS-9238
> URL: https://issues.apache.org/jira/browse/HDFS-9238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9238.001.patch
>
>
> TestFileCreation#testLeaseExpireHardLimit uses DataNodeTestUtils#getFile() to 
> open, read and verify blocks written on the DN. It’s better to use 
> getBlockInputStream() which does exactly the same thing but hides the detail 
> of getting the block file on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9238) Update TestFileCreation#testLeaseExpireHardLimit() to avoid using DataNodeTestUtils#getFile()

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9238:
--
Status: Patch Available  (was: In Progress)

> Update TestFileCreation#testLeaseExpireHardLimit() to avoid using 
> DataNodeTestUtils#getFile()
> -
>
> Key: HDFS-9238
> URL: https://issues.apache.org/jira/browse/HDFS-9238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9238.001.patch
>
>
> TestFileCreation#testLeaseExpireHardLimit uses DataNodeTestUtils#getFile() to 
> open, read and verify blocks written on the DN. It’s better to use 
> getBlockInputStream() which does exactly the same thing but hides the detail 
> of getting the block file on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (HDFS-9238) Update TestFileCreation#testLeaseExpireHardLimit() to avoid using DataNodeTestUtils#getFile()

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-9238 started by Tony Wu.
-
> Update TestFileCreation#testLeaseExpireHardLimit() to avoid using 
> DataNodeTestUtils#getFile()
> -
>
> Key: HDFS-9238
> URL: https://issues.apache.org/jira/browse/HDFS-9238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9238.001.patch
>
>
> TestFileCreation#testLeaseExpireHardLimit uses DataNodeTestUtils#getFile() to 
> open, read and verify blocks written on the DN. It’s better to use 
> getBlockInputStream() which does exactly the same thing but hides the detail 
> of getting the block file on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9238) Update TestFileCreation#testLeaseExpireHardLimit() to avoid using DataNodeTestUtils#getFile()

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9238:
--
Component/s: test
 HDFS

> Update TestFileCreation#testLeaseExpireHardLimit() to avoid using 
> DataNodeTestUtils#getFile()
> -
>
> Key: HDFS-9238
> URL: https://issues.apache.org/jira/browse/HDFS-9238
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: HDFS, test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: HDFS-9238.001.patch
>
>
> TestFileCreation#testLeaseExpireHardLimit uses DataNodeTestUtils#getFile() to 
> open, read and verify blocks written on the DN. It’s better to use 
> getBlockInputStream() which does exactly the same thing but hides the detail 
> of getting the block file on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9236) Missing sanity check for block size during block recovery

2015-10-13 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9236:
--
Component/s: HDFS

> Missing sanity check for block size during block recovery
> -
>
> Key: HDFS-9236
> URL: https://issues.apache.org/jira/browse/HDFS-9236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
> Attachments: HDFS-9236.001.patch
>
>
> Ran into an issue while running test against faulty data-node code. 
> Currently in DataNode.java:
> {code:java}
>   /** Block synchronization */
>   void syncBlock(RecoveringBlock rBlock,
>  List syncList) throws IOException {
> …
> // Calculate the best available replica state.
> ReplicaState bestState = ReplicaState.RWR;
> …
> // Calculate list of nodes that will participate in the recovery
> // and the new block size
> List participatingList = new ArrayList();
> final ExtendedBlock newBlock = new ExtendedBlock(bpid, blockId,
> -1, recoveryId);
> switch(bestState) {
> …
> case RBW:
> case RWR:
>   long minLength = Long.MAX_VALUE;
>   for(BlockRecord r : syncList) {
> ReplicaState rState = r.rInfo.getOriginalReplicaState();
> if(rState == bestState) {
>   minLength = Math.min(minLength, r.rInfo.getNumBytes());
>   participatingList.add(r);
> }
>   }
>   newBlock.setNumBytes(minLength);
>   break;
> …
> }
> …
> nn.commitBlockSynchronization(block,
> newBlock.getGenerationStamp(), newBlock.getNumBytes(), true, false,
> datanodes, storages);
>   }
> {code}
> This code is called by the DN coordinating the block recovery. In the above 
> case, it is possible for none of the rState (reported by DNs with copies of 
> the replica being recovered) to match the bestState. This can either be 
> caused by faulty DN code or stale/modified/corrupted files on DN. When this 
> happens the DN will end up reporting the minLengh of Long.MAX_VALUE.
> Unfortunately there is no check on the NN for replica length. See 
> FSNamesystem.java:
> {code:java}
>   void commitBlockSynchronization(ExtendedBlock oldBlock,
>   long newgenerationstamp, long newlength,
>   boolean closeFile, boolean deleteblock, DatanodeID[] newtargets,
>   String[] newtargetstorages) throws IOException {
> …
>   if (deleteblock) {
> Block blockToDel = ExtendedBlock.getLocalBlock(oldBlock);
> boolean remove = iFile.removeLastBlock(blockToDel) != null;
> if (remove) {
>   blockManager.removeBlock(storedBlock);
> }
>   } else {
> // update last block
> if(!copyTruncate) {
>   storedBlock.setGenerationStamp(newgenerationstamp);
>   
>   // XXX block length is updated without any check <<<   storedBlock.setNumBytes(newlength);
> }
> …
> if (closeFile) {
>   LOG.info("commitBlockSynchronization(oldBlock=" + oldBlock
>   + ", file=" + src
>   + (copyTruncate ? ", newBlock=" + truncatedBlock
>   : ", newgenerationstamp=" + newgenerationstamp)
>   + ", newlength=" + newlength
>   + ", newtargets=" + Arrays.asList(newtargets) + ") successful");
> } else {
>   LOG.info("commitBlockSynchronization(" + oldBlock + ") successful");
> }
>   }
> {code}
> After this point the block length becomes Long.MAX_VALUE. Any subsequent 
> block report (even with correct length) will cause the block to be marked as 
> corrupted. Since this is block could be the last block of the file. If this 
> happens and the client goes away, NN won’t be able to recover the lease and 
> close the file because the last block is under-replicated.
> I believe we need to have a sanity check for block size on both DN and NN to 
> prevent such case from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9148) Incorrect assert message in TestWriteToReplica#testWriteToTemporary

2015-09-28 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933910#comment-14933910
 ] 

Tony Wu commented on HDFS-9148:
---

I don't think the failed tests (TestDirectoryScanner, TestWebHDFS, 
TestWebHDFSOAuth2) has anything to do with this patch.

> Incorrect assert message in TestWriteToReplica#testWriteToTemporary
> ---
>
> Key: HDFS-9148
> URL: https://issues.apache.org/jira/browse/HDFS-9148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: hdfs-9148.patch
>
>
> The following assert text in TestWriteToReplica#testWriteToTemporary is not 
> correct:
> {code:java}
>   Assert.fail("createRbw() Should have removed the block with the older "
>   + "genstamp and replaced it with the newer one: " + 
> blocks[NON_EXISTENT]);
> {code}
> If the assert is triggered, it can only be due to an temporary replica 
> already exists and has newer generation stamp. It should have nothing to do 
> with createRbw().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9148) Incorrect assert message in TestWriteToReplica#testWriteToTemporary

2015-09-28 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9148:
--
Status: Patch Available  (was: Open)

> Incorrect assert message in TestWriteToReplica#testWriteToTemporary
> ---
>
> Key: HDFS-9148
> URL: https://issues.apache.org/jira/browse/HDFS-9148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: hdfs-9148.patch
>
>
> The following assert text in TestWriteToReplica#testWriteToTemporary is not 
> correct:
> {code:java}
>   Assert.fail("createRbw() Should have removed the block with the older "
>   + "genstamp and replaced it with the newer one: " + 
> blocks[NON_EXISTENT]);
> {code}
> If the assert is triggered, it can only be due to an temporary replica 
> already exists and has newer generation stamp. It should have nothing to do 
> with createRbw().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9148) Incorrect assert message in TestWriteToReplica#testWriteToTemporary

2015-09-28 Thread Tony Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933386#comment-14933386
 ] 

Tony Wu commented on HDFS-9148:
---

Than you Daniel, for looking at it quickly!

> Incorrect assert message in TestWriteToReplica#testWriteToTemporary
> ---
>
> Key: HDFS-9148
> URL: https://issues.apache.org/jira/browse/HDFS-9148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: hdfs-9148.patch
>
>
> The following assert text in TestWriteToReplica#testWriteToTemporary is not 
> correct:
> {code:java}
>   Assert.fail("createRbw() Should have removed the block with the older "
>   + "genstamp and replaced it with the newer one: " + 
> blocks[NON_EXISTENT]);
> {code}
> If the assert is triggered, it can only be due to an temporary replica 
> already exists and has newer generation stamp. It should have nothing to do 
> with createRbw().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9148) Incorrect assert message in TestWriteToReplica#testWriteToTemporary

2015-09-25 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9148:
--
Attachment: hdfs-9148.patch

A pretty trivial change to assert text.

> Incorrect assert message in TestWriteToReplica#testWriteToTemporary
> ---
>
> Key: HDFS-9148
> URL: https://issues.apache.org/jira/browse/HDFS-9148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: hdfs-9148.patch
>
>
> The following assert text in TestWriteToReplica#testWriteToTemporary is not 
> correct:
> {code:java}
>   Assert.fail("createRbw() Should have removed the block with the older "
>   + "genstamp and replaced it with the newer one: " + 
> blocks[NON_EXISTENT]);
> {code}
> If the assert is triggered, it can only be due to an temporary replica 
> already exists and has newer generation stamp. It should have nothing to do 
> with createRbw().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9148) Incorrect assert message in TestWriteToReplica#testWriteToTemporary

2015-09-25 Thread Tony Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Wu updated HDFS-9148:
--
Attachment: (was: hdfs-9148.patch)

> Incorrect assert message in TestWriteToReplica#testWriteToTemporary
> ---
>
> Key: HDFS-9148
> URL: https://issues.apache.org/jira/browse/HDFS-9148
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Trivial
> Attachments: hdfs-9148.patch
>
>
> The following assert text in TestWriteToReplica#testWriteToTemporary is not 
> correct:
> {code:java}
>   Assert.fail("createRbw() Should have removed the block with the older "
>   + "genstamp and replaced it with the newer one: " + 
> blocks[NON_EXISTENT]);
> {code}
> If the assert is triggered, it can only be due to an temporary replica 
> already exists and has newer generation stamp. It should have nothing to do 
> with createRbw().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >