date:20141110

[jira] [Updated] (HDFS-7389) Named user ACL cannot stop the user from accessing the FS entity.

2014-11-10 Thread Vinayakumar B (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-7389:

Assignee: Vinayakumar B
Target Version/s: 2.7.0
  Status: Patch Available  (was: Open)

> Named user ACL cannot stop the user from accessing the FS entity.
> -
>
> Key: HDFS-7389
> URL: https://issues.apache.org/jira/browse/HDFS-7389
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.1
>Reporter: Chunjun Xiao
>Assignee: Vinayakumar B
> Attachments: HDFS-7389-001.patch
>
>
> In 
> http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/:
> {quote}
> It’s important to keep in mind the order of evaluation for ACL entries when a 
> user attempts to access a file system object:
> 1. If the user is the file owner, then the owner permission bits are enforced.
> 2. Else if the user has a named user ACL entry, then those permissions are 
> enforced.
> 3. Else if the user is a member of the file’s group or any named group in an 
> ACL entry, then the union of permissions for all matching entries are 
> enforced.  (The user may be a member of multiple groups.)
> 4. If none of the above were applicable, then the other permission bits are 
> enforced.
> {quote}
> Assume we have a user UserA from group GroupA, if we config a directory as 
> following ACL entries:
> group:GroupA:rwx
> user:UserA:---
> According to the design spec above, userA should have no access permission to 
> the file object, while actually userA still has rwx access to the dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7389) Named user ACL cannot stop the user from accessing the FS entity.

2014-11-10 Thread Vinayakumar B (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-7389:

Attachment: HDFS-7389-001.patch

Thanks [~chunjun.xiao] for reporting the issue.

Here is the patch for the same.
Hi [~cnauroth] and [~wheat9], Can you take a look at the patch? thanks

> Named user ACL cannot stop the user from accessing the FS entity.
> -
>
> Key: HDFS-7389
> URL: https://issues.apache.org/jira/browse/HDFS-7389
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.1
>Reporter: Chunjun Xiao
> Attachments: HDFS-7389-001.patch
>
>
> In 
> http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/:
> {quote}
> It’s important to keep in mind the order of evaluation for ACL entries when a 
> user attempts to access a file system object:
> 1. If the user is the file owner, then the owner permission bits are enforced.
> 2. Else if the user has a named user ACL entry, then those permissions are 
> enforced.
> 3. Else if the user is a member of the file’s group or any named group in an 
> ACL entry, then the union of permissions for all matching entries are 
> enforced.  (The user may be a member of multiple groups.)
> 4. If none of the above were applicable, then the other permission bits are 
> enforced.
> {quote}
> Assume we have a user UserA from group GroupA, if we config a directory as 
> following ACL entries:
> group:GroupA:rwx
> user:UserA:---
> According to the design spec above, userA should have no access permission to 
> the file object, while actually userA still has rwx access to the dir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206049#comment-14206049
 ] 

stack commented on HDFS-7358:
-

bq. So the synchronization is required.

What of Packet#writeTo did the buffer release?

bq. We already has close(), which is the public user API for closing the stream.

Ok.

Here's a few numbers:

With the feature turned off:

||threads||seconds||ops/second||
|10|133.662|7481.558|
|10|133.599|7485.086|
|10|134.046|7460.125|
|20|140.972|14187.215|
|20|141.949|14089.566|
|20|140.861|14198.395|
|100|153.941|64959.953|
|100|153.751|65040.223|
|100|153.372|65200.953|

With version of patch that does NOT synchronize Packet:

||threads||seconds||ops/second||
|10|126.219|7922.737|
|10|127.792|7825.216|
|10|124.829|8010.959|
|20|138.132|14478.904|
|20|137.051|14593.108|
|20|139.604|14326.236|
|100|149.311|66974.297|
|100|149.849|66733.844|
|100|149.537|66873.078|

Here is latest patch numbers:

||threads||seconds||ops/second||
|10|127.079|7869.121|
|10|128.357|7790.771|
|10|129.122|7744.614|
|20|135.525|14757.426|
|20|139.531|14333.731|
|20|135.595|14749.807|
|100|149.802|66754.781|
|100|149.262|66996.289|
|100|149.925|66700.016|

Threads in above are client threads. Actual number of writing and syncing 
threads stays constant at 1 and 5.  More threads just means more writing per 
second.

Comparing the last run of 100 threads with the feature off vs the last run of 
the latest patch I see more stalls, about same instructions per cycle but less 
cycles so it comes out a bit better.

Perf summary on unpatched run:
{code}
 Performance counter stats for '/home/stack/hbase/bin/hbase --config 
/home/stack/conf_hbase 
org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -threads 100 
-iterations 10 -keySize 50 -valueSize 100':

 587172.254075 task-clock#3.666 CPUs utilized
18,700,961 context-switches  #0.032 M/sec
 4,596,456 CPU-migrations#0.008 M/sec
   650,547 page-faults   #0.001 M/sec
   891,035,644,874 cycles#1.518 GHz 
[83.31%]
   674,789,502,548 stalled-cycles-frontend   #   75.73% frontend cycles idle
[83.32%]
   400,621,650,589 stalled-cycles-backend#   44.96% backend  cycles idle
[66.74%]
   422,912,592,386 instructions  #0.47  insns per cycle
 #1.60  stalled cycles per insn 
[83.41%]
78,498,471,337 branches  #  133.689 M/sec   
[83.37%]
 2,768,724,048 branch-misses #3.53% of all branches 
[83.26%]

 160.168742742 seconds time elapsed
{code}

Here is patched version perf output.
{code}
 Performance counter stats for '/home/stack/hbase/bin/hbase --config 
/home/stack/conf_hbase 
org.apache.hadoop.hbase.regionserver.wal.HLogPerformanceEvaluation -threads 100 
-iterations 10 -keySize 50 -valueSize 100':

 556038.390042 task-clock#3.550 CPUs utilized
18,699,748 context-switches  #0.034 M/sec
 4,534,830 CPU-migrations#0.008 M/sec
   636,724 page-faults   #0.001 M/sec
   843,860,285,154 cycles#1.518 GHz 
[83.29%]
   642,851,753,015 stalled-cycles-frontend   #   76.18% frontend cycles idle
[83.34%]
   384,260,620,446 stalled-cycles-backend#   45.54% backend  cycles idle
[66.66%]
   392,462,867,299 instructions  #0.47  insns per cycle
 #1.64  stalled cycles per insn 
[83.36%]
71,358,339,182 branches  #  128.333 M/sec   
[83.43%]
 2,712,426,902 branch-misses #3.80% of all branches 
[83.29%]

 156.646653202 seconds time elapsed
{code}



> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205997#comment-14205997
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680729/HDFS-7314-7.patch
  against trunk revision 58e9bf4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8711//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8711//console

This message is automatically generated.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClie

[jira] [Commented] (HDFS-7345) Local Reconstruction Codes (LRC)

2014-11-10 Thread Kai Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205962#comment-14205962
 ] 

Kai Zheng commented on HDFS-7345:
-

>From [Facebook’s advanced erasure 
>codes|http://storagemojo.com/2013/06/21/facebooks-advanced-erasure-codes/]:
LRC test results found several key results.
* Disk I/O and network traffic were reduced by half compared to RS codes.
* The LRC required 14% more storage than RS, information theoretically optimal 
for the obtained locality.
* Repairs times were much lower thanks to the local repair codes.
* Much greater reliability thanks to fast repairs.
* Reduced network traffic makes them suitable for geographic distribution.

So looks like LRC is quite apprealing to HDFS. I'm wondering if there is any IP 
concern if we do. The concern is there because LRC is from MS research and I 
haven't got any confirm yet it's available to the community.

Could anyone help confirm this, about the LRC IP concern?

> Local Reconstruction Codes (LRC)
> 
>
> Key: HDFS-7345
> URL: https://issues.apache.org/jira/browse/HDFS-7345
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Kai Zheng
>Assignee: Kai Zheng
>
> HDFS-7285 proposes to support Erasure Coding inside HDFS, supports multiple 
> Erasure Coding codecs via pluggable framework and implements Reed Solomon 
> code by default. This is to support a more advanced coding mechanism, Local 
> Reconstruction Codes (LRC). As discussed in the paper 
> (https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf), 
> LRC reduces the number of erasure coding fragments that need to be read when 
> reconstructing data fragments that are offline, while still keeping the 
> storage overhead low. The important benefits of LRC are that it reduces the 
> bandwidth and I/Os required for repair reads over prior codes, while still 
> allowing a significant reduction in storage overhead. Intel ISA library also 
> supports LRC in its update and can also be leveraged. The implementation 
> would also consider how to distribute the calculating of local and global 
> parity blocks to other relevant DataNodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7375) Move FSClusterStats to o.a.h.h.hdfs.server.blockmanagement

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205901#comment-14205901
 ] 

Hadoop QA commented on HDFS-7375:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680034/HDFS-7375.001.patch
  against trunk revision 2cc868d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestLeaseRecovery2
  org.apache.hadoop.hdfs.server.balancer.TestBalancer

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestFileAppend2

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8709//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8709//console

This message is automatically generated.

> Move FSClusterStats to o.a.h.h.hdfs.server.blockmanagement
> --
>
> Key: HDFS-7375
> URL: https://issues.apache.org/jira/browse/HDFS-7375
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7375.000.patch, HDFS-7375.001.patch
>
>
> {{FSClusterStats}} is a private class that exports statistics for 
> {{BlockPlacementPolicy}}. This jira proposes moving it to {{ 
> o.a.h.h.hdfs.server.blockmanagement}} to simplify the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6757) Simplify lease manager with INodeID

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205902#comment-14205902
 ] 

Hadoop QA commented on HDFS-6757:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12679938/HDFS-6757.008.patch
  against trunk revision 2cc868d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestFileAppend2

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8710//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8710//console

This message is automatically generated.

> Simplify lease manager with INodeID
> ---
>
> Key: HDFS-6757
> URL: https://issues.apache.org/jira/browse/HDFS-6757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-6757.000.patch, HDFS-6757.001.patch, 
> HDFS-6757.002.patch, HDFS-6757.003.patch, HDFS-6757.004.patch, 
> HDFS-6757.005.patch, HDFS-6757.006.patch, HDFS-6757.007.patch, 
> HDFS-6757.008.patch
>
>
> Currently the lease manager records leases based on path instead of inode 
> ids. Therefore, the lease manager needs to carefully keep track of the path 
> of active leases during renames and deletes. This can be a non-trivial task.
> This jira proposes to simplify the logic by tracking leases using inodeids 
> instead of paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7389) Named user ACL cannot stop the user from accessing the FS entity.

2014-11-10 Thread Chunjun Xiao (JIRA)

Chunjun Xiao created HDFS-7389:
--

 Summary: Named user ACL cannot stop the user from accessing the FS 
entity.
 Key: HDFS-7389
 URL: https://issues.apache.org/jira/browse/HDFS-7389
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.5.1
Reporter: Chunjun Xiao


In 
http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/:
{quote}
It’s important to keep in mind the order of evaluation for ACL entries when a 
user attempts to access a file system object:

1. If the user is the file owner, then the owner permission bits are enforced.
2. Else if the user has a named user ACL entry, then those permissions are 
enforced.
3. Else if the user is a member of the file’s group or any named group in an 
ACL entry, then the union of permissions for all matching entries are enforced. 
 (The user may be a member of multiple groups.)
4. If none of the above were applicable, then the other permission bits are 
enforced.
{quote}

Assume we have a user UserA from group GroupA, if we config a directory as 
following ACL entries:
group:GroupA:rwx
user:UserA:---

According to the design spec above, userA should have no access permission to 
the file object, while actually userA still has rwx access to the dir.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Ming Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7314:
--
Attachment: HDFS-7314-7.patch

Updated unit test TestDistributedFileSystem as the test has the assumption that 
the same LeaseRenewer object will be used even after the lease renewal thread 
expires; due to the fact that the test calls {{getLeaseRenewer()}} after the 
stream is closed.

Given {{getLeaseRenewer()}} no longer calls addClient,  the {{LeaseRenewer}} 
object will be released as part of lease renewal thread expiration. Thus the 
test needs to set the grace period value on the new object.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-10 Thread jiangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205843#comment-14205843
 ] 

jiangyu commented on HDFS-7385:
---

You can also use OfflineEditsViewer to find this bug easily.

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-10 Thread jiangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205822#comment-14205822
 ] 

jiangyu commented on HDFS-7385:
---

That is right Colin, I think we should set null if there are no ACL entires. 
For now, this bug have messed up almost all the permissions in our cluster. It 
is easy to reproduce, just set some Acl Entries and mkdir randomly for some 
times, then after you restart the NameNode or transition SNN to ANN, you can 
find some directories permission are not what you expected easily. I wonder if 
there are no company use Acl Feature?

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Jing Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205737#comment-14205737
 ] 

Jing Zhao commented on HDFS-7387:
-

Thanks for the fix, Brandon! The patch looks good to me. Some minors in the 
test:
# This line needs clean
{code}
+//Mockito.when(fos.getPos()).thenReturn((long) 6);
{code}
# It will be helpful to have some javadoc in the test to explain what scenarios 
have been covered

+1 after addressing the comments.

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7387.001.patch
>
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205685#comment-14205685
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7358:
---

> We need all this new synchronization on Packet? ...

TestHFlush calls write(..), interrupts the thread and then calls close().  
Write(..) simply puts data to the packet queue and DataStreamer, a separated 
thread, will get the packets from the queue and then write it to socket.  Now, 
we set buf to null during close.  DataStreamer may get NPE when accessing the 
data.  So the synchronization is required.

> nit: rename setClosed to close if you end up making a new patch.

We already has close(), which is the public user API for closing the stream.

> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-10 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205669#comment-14205669
 ] 

Ravi Prakash commented on HDFS-4882:


Thanks for your review Colin! Your understanding is correct. In this case, for 
a very strange reason which I have as yet not been able to uncover, the 
FSNamesystem wasn't able to recover the lease. I am investigating this root 
issue in HDFS-7342. In the meantime, however I'd argue that the Namenode should 
never enter an infinite loop for whatever reason, and instead of assuming that 
we have fixed all possible reasons why a lease couldn't be recovered, we should 
relinquish the lock regularly. We should display on the webUI how many files 
are open for writing and allow ops to forcibly close open files (HDFS-7307) . 
The way in which this error happens (NN suddenly stops working) is egregious.

sortedLeases is being used externally in FSNamesystem.getCompleteBlocksTotal() 
as well . We were also actively modifying it in checkLeases. I'm sure we can 
move things around to keep using SortedSets, but I don't know if this 
Collection will ever really become too big for the performance difference to 
matter. What do you think?

> Namenode LeaseManager checkLeases() runs into infinite loop
> ---
>
> Key: HDFS-4882
> URL: https://issues.apache.org/jira/browse/HDFS-4882
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, namenode
>Affects Versions: 2.0.0-alpha, 2.5.1
>Reporter: Zesheng Wu
>Assignee: Ravi Prakash
>Priority: Critical
> Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205634#comment-14205634
 ] 

Hadoop QA commented on HDFS-7387:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680689/HDFS-7387.001.patch
  against trunk revision 68a0508.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs-nfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8708//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8708//console

This message is automatically generated.

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7387.001.patch
>
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7056) Snapshot support for truncate

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205630#comment-14205630
 ] 

Hadoop QA commented on HDFS-7056:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12680652/HDFS-3107-HDFS-7056-combined.patch
  against trunk revision eace218.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.TestCommitBlockSynchronization
  
org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer
  org.apache.hadoop.hdfs.TestFileCreation

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8706//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8706//console

This message is automatically generated.

> Snapshot support for truncate
> -
>
> Key: HDFS-7056
> URL: https://issues.apache.org/jira/browse/HDFS-7056
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, 
> HDFS-3107-HDFS-7056-combined.patch, HDFS-7056.patch, HDFS-7056.patch, 
> HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx
>
>
> Implementation of truncate in HDFS-3107 does not allow truncating files which 
> are in a snapshot. It is desirable to be able to truncate and still keep the 
> old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205631#comment-14205631
 ] 

Hadoop QA commented on HDFS-7314:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680685/HDFS-7314-6.patch
  against trunk revision 68a0508.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDistributedFileSystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8707//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8707//console

This message is automatically generated.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix al

[jira] [Updated] (HDFS-7274) Disable SSLv3 in HttpFS

2014-11-10 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated HDFS-7274:
---
Target Version/s:   (was: 2.6.0)
   Fix Version/s: (was: 2.6.0)
  2.5.2

Included this in 2.5.2 as well. 

> Disable SSLv3 in HttpFS
> ---
>
> Key: HDFS-7274
> URL: https://issues.apache.org/jira/browse/HDFS-7274
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Blocker
> Fix For: 2.5.2
>
> Attachments: HDFS-7274.patch, HDFS-7274.patch
>
>
> We should disable SSLv3 in HttpFS to protect against the POODLEbleed 
> vulnerability.
> See 
> [CVE-2014-3566|http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566]
> We have {{sslProtocol="TLS"}} set to only allow TLS in ssl-server.xml, but 
> when I checked, I could still connect with SSLv3.  There documentation is 
> somewhat unclear in the tomcat configs between {{sslProtocol}}, 
> {{sslProtocols}}, and {{sslEnabledProtocols}} and what each value they take 
> does exactly.  From what I can gather, {{sslProtocol="TLS"}} actually 
> includes SSLv3 and the only way to fix this is to explicitly list which TLS 
> versions we support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Brandon Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7387:
-
Attachment: HDFS-7387.001.patch

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7387.001.patch
>
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Brandon Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7387:
-
Status: Patch Available  (was: Open)

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
> Attachments: HDFS-7387.001.patch
>
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7388) Improve the log for HA failover

2014-11-10 Thread Jing Zhao (JIRA)

Jing Zhao created HDFS-7388:
---

 Summary: Improve the log for HA failover
 Key: HDFS-7388
 URL: https://issues.apache.org/jira/browse/HDFS-7388
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha
Reporter: Jing Zhao


Currently to debug an issue during the NN HA failover (with automatic failover 
setup), we usually need to check log from all the NN and ZKFC daemons to figure 
out what triggered the failover. Possible reasons include NN health issue, 
connection issues between zkfc and NN, connection issues between zkfc and ZK, 
and manual op started by admin. It will be helpful to improve the log in NN and 
ZKFC and add more information to make the debug easier. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7388) Improve the log for HA failover

2014-11-10 Thread Jing Zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao reassigned HDFS-7388:
---

Assignee: Jing Zhao

> Improve the log for HA failover
> ---
>
> Key: HDFS-7388
> URL: https://issues.apache.org/jira/browse/HDFS-7388
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha
>Reporter: Jing Zhao
>Assignee: Jing Zhao
>
> Currently to debug an issue during the NN HA failover (with automatic 
> failover setup), we usually need to check log from all the NN and ZKFC 
> daemons to figure out what triggered the failover. Possible reasons include 
> NN health issue, connection issues between zkfc and NN, connection issues 
> between zkfc and ZK, and manual op started by admin. It will be helpful to 
> improve the log in NN and ZKFC and add more information to make the debug 
> easier. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Brandon Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7387:
-
Affects Version/s: 2.6.0

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Brandon Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Li updated HDFS-7387:
-
Component/s: nfs

> NFS may only do partial commit due to a race between COMMIT and write
> -
>
> Key: HDFS-7387
> URL: https://issues.apache.org/jira/browse/HDFS-7387
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Brandon Li
>Assignee: Brandon Li
>Priority: Critical
>
> The requested range may not be committed when the following happens:
> 1. the last pending write is removed from the queue to write to hdfs
> 2. a commit request arrives, NFS sees there is not pending write, and it will 
> do a sync
> 3. this sync request could flush only part of the last write to hdfs
> 4. if a file read happens immediately after the above steps, the user may not 
> see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Ming Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7314:
--
Attachment: HDFS-7314-6.patch

Thanks, Colin. Keeping the thread running shouldn't abort the same clients more 
than once. But I agree with you it is better to let the thread go.

There is another race condition between {{beginFileLease}} and {{LeaseRenewer}} 
abort lease.

1. {{beginFileLease}} calls into {{getLeaseRenewer}}, which adds the 
{{DFSClient}} to the LeaseRenewer's list.
2. {{LeaseRenewer}} removes all {{DFSClient}} upon the socket timeout, 
including the {{DFSClient}} just added.
3. {{beginFileLease}} continue to call {{LeaseRenewer}}'s {{put}} method. It 
adds the file to {{DFSClient}}. But given {{DFSClient}} isn't in LeaseRenewer's 
list, its lease won't be renewed.

The patch also fixes the new scenario by moving {{addClient}} to {{put}} method.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7387) NFS may only do partial commit due to a race between COMMIT and write

2014-11-10 Thread Brandon Li (JIRA)

Brandon Li created HDFS-7387:


 Summary: NFS may only do partial commit due to a race between 
COMMIT and write
 Key: HDFS-7387
 URL: https://issues.apache.org/jira/browse/HDFS-7387
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Brandon Li
Assignee: Brandon Li
Priority: Critical


The requested range may not be committed when the following happens:
1. the last pending write is removed from the queue to write to hdfs
2. a commit request arrives, NFS sees there is not pending write, and it will 
do a sync
3. this sync request could flush only part of the last write to hdfs
4. if a file read happens immediately after the above steps, the user may not 
see all the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7312) Update DistCp v1 to optionally not use tmp location

2014-11-10 Thread Joseph Prosser (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Prosser updated HDFS-7312:
-
Attachment: HDFS-7312.002.patch

this patch is for branch-1 only

> Update DistCp v1 to optionally not use tmp location
> ---
>
> Key: HDFS-7312
> URL: https://issues.apache.org/jira/browse/HDFS-7312
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.5.1
>Reporter: Joseph Prosser
>Assignee: Joseph Prosser
>Priority: Minor
> Attachments: HDFS-7312.001.patch, HDFS-7312.002.patch, HDFS-7312.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> DistCp v1 currently copies files to a tmp location and then renames that to 
> the specified destination.  This can cause performance issues on filesystems 
> such as S3.  A -skiptmp flag will be added to bypass this step and copy 
> directly to the destination.  This feature mirrors a similar one added to 
> HBase ExportSnapshot 
> [HBASE-9|https://issues.apache.org/jira/browse/HBASE-9]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-3107) HDFS truncate

2014-11-10 Thread Plamen Jeliazkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated HDFS-3107:
---
Status: Open  (was: Patch Available)

> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, HDFS-3107.008.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS_truncate.pdf, HDFS_truncate.pdf, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7056) Snapshot support for truncate

2014-11-10 Thread Plamen Jeliazkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated HDFS-7056:
---
Attachment: HDFS-3107-HDFS-7056-combined.patch

Attaching refreshed combined patch of the latest HDFS-3107 and latest HDFS-7056 
work.

The notable changes between the 2 versions of the combined patches is the Quota 
/ diskSpaceConsumed() improvements and the fixes to comments raised in 
HDFS-7056.

> Snapshot support for truncate
> -
>
> Key: HDFS-7056
> URL: https://issues.apache.org/jira/browse/HDFS-7056
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, 
> HDFS-3107-HDFS-7056-combined.patch, HDFS-7056.patch, HDFS-7056.patch, 
> HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx
>
>
> Implementation of truncate in HDFS-3107 does not allow truncating files which 
> are in a snapshot. It is desirable to be able to truncate and still keep the 
> old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-3107) HDFS truncate

2014-11-10 Thread Plamen Jeliazkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated HDFS-3107:
---
Status: Patch Available  (was: Open)

> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Lei Chang
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, HDFS-3107.008.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, 
> HDFS_truncate.pdf, HDFS_truncate.pdf, HDFS_truncate.pdf, 
> HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf, 
> editsStored, editsStored.xml
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7056) Snapshot support for truncate

2014-11-10 Thread Plamen Jeliazkov (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Plamen Jeliazkov updated HDFS-7056:
---
Status: Patch Available  (was: Open)

> Snapshot support for truncate
> -
>
> Key: HDFS-7056
> URL: https://issues.apache.org/jira/browse/HDFS-7056
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, 
> HDFS-3107-HDFS-7056-combined.patch, HDFS-7056.patch, HDFS-7056.patch, 
> HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx
>
>
> Implementation of truncate in HDFS-3107 does not allow truncating files which 
> are in a snapshot. It is desirable to be able to truncate and still keep the 
> old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205270#comment-14205270
 ] 

Colin Patrick McCabe commented on HDFS-7314:


Good catch.  This code is certainly somewhat subtle.  I think that the 
{{currentId}} variable was intended to address the problem you're describing.

Keeping the thread running seems strange.  Is it going to abort the clients 
it's tracking more than once?  I would rather stop it if at all possible.

It seems like maybe what we should do here is set {{emptyTime}} to 0 and break 
out of the loop to exit the thread.  This will lead to the current 
{{LeaseRenewer}} thread being considered "expired" and not used in 
{{LeaseRenewer#put}}.  So there should be no race condition then, because 
{{LeaseRenewer#put}} will create a new thread (and increment {{currentId}}) if 
the current one is expired.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6803) Documenting DFSClient#DFSInputStream expectations reading and preading in concurrent context

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205251#comment-14205251
 ] 

Colin Patrick McCabe commented on HDFS-6803:


+1 from me.  [~stev...@iseran.com], what do you think?

Maybe we can open another JIRA to have a longer discussion about whether input 
streams should be thread-safe (I think they should)

> Documenting DFSClient#DFSInputStream expectations reading and preading in 
> concurrent context
> 
>
> Key: HDFS-6803
> URL: https://issues.apache.org/jira/browse/HDFS-6803
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 2.4.1
>Reporter: stack
> Attachments: 9117.md.txt, DocumentingDFSClientDFSInputStream (1).pdf, 
> DocumentingDFSClientDFSInputStream.v2.pdf, HDFS-6803v2.txt, HDFS-6803v3.txt, 
> fsdatainputstream.md.v3.html
>
>
> Reviews of the patch posted the parent task suggest that we be more explicit 
> about how DFSIS is expected to behave when being read by contending threads. 
> It is also suggested that presumptions made internally be made explicit 
> documenting expectations.
> Before we put up a patch we've made a document of assertions we'd like to 
> make into tenets of DFSInputSteam.  If agreement, we'll attach to this issue 
> a patch that weaves the assumptions into DFSIS as javadoc and class comments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users

2014-11-10 Thread Maysam Yabandeh (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205240#comment-14205240
 ] 

Maysam Yabandeh commented on HDFS-6982:
---

[~andrew.wang] do you have more comments about the new patch?

> nntop: top-like tool for name node users
> -
>
> Key: HDFS-6982
> URL: https://issues.apache.org/jira/browse/HDFS-6982
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Maysam Yabandeh
>Assignee: Maysam Yabandeh
> Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
> HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
> nntop-design-v1.pdf
>
>
> In this jira we motivate the need for nntop, a tool that, similarly to what 
> top does in Linux, gives the list of top users of the HDFS name node and 
> gives insight about which users are sending majority of each traffic type to 
> the name node. This information turns out to be the most critical when the 
> name node is under pressure and the HDFS admin needs to know which user is 
> hammering the name node and with what kind of requests. Here we present the 
> design of nntop which has been in production at Twitter in the past 10 
> months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K 
> nodes), low memory footprint (less than a few MB), and quite efficient for 
> the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205241#comment-14205241
 ] 

Ming Ma commented on HDFS-7314:
---

Thanks, Colin. The reason to keep the thread running is to handle the following 
race condition.

1. leaseRenewal thread is aborting.
2. The application creates files before leaseRenewal is removed from the 
factory. So DFSClient is added to the leaseRenewal object.
3. leaseRenewal thread exits. So nobody will renew lease for that DFSClient.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7314) Aborted DFSClient's impact on long running service like YARN

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205204#comment-14205204
 ] 

Colin Patrick McCabe commented on HDFS-7314:


{code}
@@ -450,10 +455,11 @@ private void run(final int id) throws 
InterruptedException {
   + (elapsed/1000) + " seconds.  Aborting ...", ie);
   synchronized (this) {
 while (!dfsclients.isEmpty()) {
-  dfsclients.get(0).abort();
+  DFSClient dfsClient = dfsclients.get(0);
+  dfsClient.closeAllFilesBeingWritten(true);
+  closeClient(dfsClient);
 }
   }
-  break;
 } catch (IOException ie) {
   LOG.warn("Failed to renew lease for " + clientsString() + " for "
   + (elapsed/1000) + " seconds.  Will retry shortly ...", ie);
{code}
It seems like getting rid of "break" here is going to lead to the 
{{LeaseRenewer}} thread for the client continuing to run after the client's 
lease has been aborted.  This doesn't seem like what we want?  After all, we 
are going to create a new {{LeaseRenewer}} if the {{DFSClient}} opens another 
file for write.

> Aborted DFSClient's impact on long running service like YARN
> 
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7312) Update DistCp v1 to optionally not use tmp location

2014-11-10 Thread Joseph Prosser (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Prosser updated HDFS-7312:
-
Attachment: HDFS-7312.001.patch

This patch is for branch-1 only
Added comment, reupload with correct username

> Update DistCp v1 to optionally not use tmp location
> ---
>
> Key: HDFS-7312
> URL: https://issues.apache.org/jira/browse/HDFS-7312
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.5.1
>Reporter: Joseph Prosser
>Assignee: Joseph Prosser
>Priority: Minor
> Attachments: HDFS-7312.001.patch, HDFS-7312.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> DistCp v1 currently copies files to a tmp location and then renames that to 
> the specified destination.  This can cause performance issues on filesystems 
> such as S3.  A -skiptmp flag will be added to bypass this step and copy 
> directly to the destination.  This feature mirrors a similar one added to 
> HBase ExportSnapshot 
> [HBASE-9|https://issues.apache.org/jira/browse/HBASE-9]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205165#comment-14205165
 ] 

Colin Patrick McCabe commented on HDFS-7385:


Good find, [~jiangyu1211].  In the future you might want to consider keeping 
the description short and putting the details of how you found the bug, etc. in 
the comments.

It sounds like the problem here is that we are not calling {{op.setAclEntries}} 
in the case where there are no ACL entries?  So the previous ACL entry from the 
thread-local Op gets used.

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7382) DataNode in secure mode may throw NullPointerException if client connects before DataNode registers itself with NameNode.

2014-11-10 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205149#comment-14205149
 ] 

Yongjun Zhang commented on HDFS-7382:
-

HI [~cnauroth],  Thanks for addressing my earlier comments, and suggestion. 
I've created HDFS-7386 as a follow-up.




> DataNode in secure mode may throw NullPointerException if client connects 
> before DataNode registers itself with NameNode.
> -
>
> Key: HDFS-7382
> URL: https://issues.apache.org/jira/browse/HDFS-7382
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, security
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
> Fix For: 2.6.0
>
> Attachments: HDFS-7382.1.patch, HDFS-7382.2.patch
>
>
> {{SaslDataTransferServer#receive}} needs to check if the DataNode is 
> listening on a privileged port.  It does this by checking the address from 
> the {{DatanodeID}}.  However, there is a window of time when this will be 
> {{null}}.  If a client is still holding a {{LocatedBlock}} that references 
> that DataNode and chooses to connect, then there is a risk of getting a 
> {{NullPointerException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-10 Thread Yongjun Zhang (JIRA)

Yongjun Zhang created HDFS-7386:
---

 Summary: Replace check "port number < 1024" with shared 
isPrivilegedPort method 
 Key: HDFS-7386
 URL: https://issues.apache.org/jira/browse/HDFS-7386
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang


Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
check "port number < 1024" with shared isPrivilegedPort method.

Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205143#comment-14205143
 ] 

Hadoop QA commented on HDFS-7146:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12680610/HDFS-7146.004.patch
  against trunk revision ab30d51.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-nfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8705//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8705//console

This message is automatically generated.

> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205138#comment-14205138
 ] 

Colin Patrick McCabe commented on HDFS-4882:


If I understand this correctly, the issue here is that there is an expired 
lease which is not getting removed from {{sortedLeases}}, causing 
{{LeaseManager#checkLeases}} to loop forever.  Maybe this is a dumb question, 
but shouldn't we fix the code so that this expired lease does get removed?  Is 
this patch really fixing the root issue?  It seems like if we let expired 
leases stick around forever we may have other problems.

Also, this patch seems to replace a {{SortedSet}} with a 
{{ConcurrentSkipListSet}}.  We don't need the overhead of a concurrent set 
here... the set is only modified while holding the lock.  If you want to modify 
while iterating, you can simply use an {{Iterator}} for this purpose.  Or, 
since the set is sorted, you can use {{SortedSet#tailSet}} to find the element 
after the previous element you were looking at.

> Namenode LeaseManager checkLeases() runs into infinite loop
> ---
>
> Key: HDFS-4882
> URL: https://issues.apache.org/jira/browse/HDFS-4882
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, namenode
>Affects Versions: 2.0.0-alpha, 2.5.1
>Reporter: Zesheng Wu
>Assignee: Ravi Prakash
>Priority: Critical
> Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7056) Snapshot support for truncate

2014-11-10 Thread Konstantin Shvachko (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-7056:
--
Attachment: HDFS-7056.patch

This patch implements diskspaceConsumed() and computeContentSummary() so that 
it takes into account blocks in the file and in the snapshot.
As Plamen stated with truncate diskspaceConsumed() cannot just compute bytes of 
the current file. The new implementation actually counts all bytes that are 
stored on DataNodes as blocks of a file or of its snapshot. The number of bytes 
is multiplied by the replication factor as in current code.
Note that if truncates are not used the space consumed will be the same as 
today.

> Snapshot support for truncate
> -
>
> Key: HDFS-7056
> URL: https://issues.apache.org/jira/browse/HDFS-7056
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, HDFS-7056.patch, 
> HDFS-7056.patch, HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx
>
>
> Implementation of truncate in HDFS-3107 does not allow truncating files which 
> are in a snapshot. It is desirable to be able to truncate and still keep the 
> old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-10 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205128#comment-14205128
 ] 

Chris Nauroth commented on HDFS-7328:
-

No problem, [~cmccabe].

bq. It seems like Java chose to do things differently and as a consequence we 
probably have a lot of these cases.

Don't you just love consistency?  :-)

There have been a lot of these cases throughout the past 2 years, but I think 
all of them have been cleaned up at this point.  If we had Windows Jenkins, 
then it would help catch introduction of new occurrences.  I hope to revisit 
this topic on the Apache infra side in the next several months.

> TestTraceAdmin assumes Unix line endings.
> -
>
> Key: HDFS-7328
> URL: https://issues.apache.org/jira/browse/HDFS-7328
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Fix For: 2.6.0
>
> Attachments: HDFS-7328.1.patch
>
>
> {{TestTraceAdmin}} contains some string assertions that assume Unix line 
> endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7328) TestTraceAdmin assumes Unix line endings.

2014-11-10 Thread Colin Patrick McCabe (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205112#comment-14205112
 ] 

Colin Patrick McCabe commented on HDFS-7328:


Thanks for fixing this, [~cnauroth].  In C and C++, {{"\n"}} is expanded into 
the platform-specific newline sequence.  See 
http://stackoverflow.com/questions/6891252/c-newline-character-under-windows-command-line-redirection
 .  It seems like Java chose to do things differently and as a consequence we 
probably have a lot of these cases.

> TestTraceAdmin assumes Unix line endings.
> -
>
> Key: HDFS-7328
> URL: https://issues.apache.org/jira/browse/HDFS-7328
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Trivial
> Fix For: 2.6.0
>
> Attachments: HDFS-7328.1.patch
>
>
> {{TestTraceAdmin}} contains some string assertions that assume Unix line 
> endings.  The test fails on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-10 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205080#comment-14205080
 ] 

Yongjun Zhang commented on HDFS-7146:
-

Hi Guys,

Sorry to get back late. I just uploaded a patch on top of the HADOOP-11195. I'd 
appreciate it that you could help reviewing it when you have time.

A recap of the solution:

# At initialization, the maps are empty
# Both users/groups/ids are added to the map on demand (e.g. when requested),
# When groupId is requested for a given groupName, and if the groupName is 
numerical, the full group map is loaded (this is lazy full list load I referred 
to earlier)
# Periodically update the cached maps for both user and group. What I do here 
is actually to clear (reinitialize the maps). I imaged that some users and 
groups might be removed (for example, a user changed job, so their entries need 
to be removed). 
# Steps 2 and 3 will be repeated. 

BTW, because now we changed to incrementally updating the map, there tends to 
be a lot of messages like
{quote}
LOG.info("Updated " + mapName + " map size: " + map.size());
{quote}
I took the liberty to change it to a debug message in the patch.

Thanks.



> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-10 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7146:

Attachment: HDFS-7146.004.patch

> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7312) Update DistCp v1 to optionally not use tmp location

2014-11-10 Thread Joseph Prosser (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Prosser updated HDFS-7312:
-
Attachment: HDFS-7312.patch

this patch is for branch-1 only


> Update DistCp v1 to optionally not use tmp location
> ---
>
> Key: HDFS-7312
> URL: https://issues.apache.org/jira/browse/HDFS-7312
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.5.1
>Reporter: Joseph Prosser
>Assignee: Joseph Prosser
>Priority: Minor
> Attachments: HDFS-7312.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> DistCp v1 currently copies files to a tmp location and then renames that to 
> the specified destination.  This can cause performance issues on filesystems 
> such as S3.  A -skiptmp flag will be added to bypass this step and copy 
> directly to the destination.  This feature mirrors a similar one added to 
> HBase ExportSnapshot 
> [HBASE-9|https://issues.apache.org/jira/browse/HBASE-9]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204987#comment-14204987
 ] 

stack commented on HDFS-7358:
-

In hbase we do 16k dfs.bytes-per-checksum Thanks for filing the 64k issue. I 
was going to ask. Instead let me dig and try and add findings to new issue.

I tried the patch on my little rig and it works; no more getting stuck.

We need all this new synchronization on Packet? Any chance of instead tracing 
to figure where a Packet might be referenced (and we'd go writeData to the 
released internal buf) after its been closed? (Is this how you fixed 
TestHFlush?)

nit: rename setClosed to close if you end up making a new patch.



> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7362) Proxy user refresh won't modify or remove existing groups or hosts from super user list

2014-11-10 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne resolved HDFS-7362.
--
Resolution: Duplicate

This is not exactly a dup of HADOOP-10817, but changes made for HADOOP-10817 
fixed this issue.


> Proxy user refresh won't modify or remove existing groups or hosts from super 
> user list
> ---
>
> Key: HDFS-7362
> URL: https://issues.apache.org/jira/browse/HDFS-7362
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.5.0
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> 2.x added a new DefaultImpersonationProvider class for reading the superuser
> configuration. In this class, once the host and group properties for a 
> proxyuser are defined, they cannot be removed or modified without bouncing 
> the daemon.
> As long as the config is updated correctly the first time, this problem won't 
> manifest itself. Once defined, these properties don't tend to change. 
> However, if the properties are mis-entered the first time, restarting the 
> NN/RM/JHS/etc will be necessary to correctly re-read the config. An admin 
> refresh won't do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7383) DataNode.requestShortCircuitFdsForRead may throw NullPointerException

2014-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204871#comment-14204871
 ] 

Hudson commented on HDFS-7383:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1953 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1953/])
HDFS-7383. DataNode.requestShortCircuitFdsForRead may throw 
NullPointerException. Contributed by Tsz Wo Nicholas Sze. (sureshms: rev 
4ddc5cad0a4175f7f5ef9504a7365601dc7e63b4)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DatanodeUtil.java
HDFS-7383. Merged to branch-2.6 also. (acmurthy: rev 
f62ec31739cc15097107655c6c8265b5d3625817)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> DataNode.requestShortCircuitFdsForRead may throw NullPointerException
> -
>
> Key: HDFS-7383
> URL: https://issues.apache.org/jira/browse/HDFS-7383
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.6.0
>
> Attachments: h7383_20141108.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.requestShortCircuitFdsForRead(DataNode.java:1525)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitFds(DataXceiver.java:286)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitFds(Receiver.java:185)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:89)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7383) DataNode.requestShortCircuitFdsForRead may throw NullPointerException

2014-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204813#comment-14204813
 ] 

Hudson commented on HDFS-7383:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1929 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1929/])
HDFS-7383. DataNode.requestShortCircuitFdsForRead may throw 
NullPointerException. Contributed by Tsz Wo Nicholas Sze. (sureshms: rev 
4ddc5cad0a4175f7f5ef9504a7365601dc7e63b4)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DatanodeUtil.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
HDFS-7383. Merged to branch-2.6 also. (acmurthy: rev 
f62ec31739cc15097107655c6c8265b5d3625817)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> DataNode.requestShortCircuitFdsForRead may throw NullPointerException
> -
>
> Key: HDFS-7383
> URL: https://issues.apache.org/jira/browse/HDFS-7383
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.6.0
>
> Attachments: h7383_20141108.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.requestShortCircuitFdsForRead(DataNode.java:1525)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitFds(DataXceiver.java:286)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitFds(Receiver.java:185)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:89)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7383) DataNode.requestShortCircuitFdsForRead may throw NullPointerException

2014-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204685#comment-14204685
 ] 

Hudson commented on HDFS-7383:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #739 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/739/])
HDFS-7383. DataNode.requestShortCircuitFdsForRead may throw 
NullPointerException. Contributed by Tsz Wo Nicholas Sze. (sureshms: rev 
4ddc5cad0a4175f7f5ef9504a7365601dc7e63b4)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetCache.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DatanodeUtil.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
HDFS-7383. Merged to branch-2.6 also. (acmurthy: rev 
f62ec31739cc15097107655c6c8265b5d3625817)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> DataNode.requestShortCircuitFdsForRead may throw NullPointerException
> -
>
> Key: HDFS-7383
> URL: https://issues.apache.org/jira/browse/HDFS-7383
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.6.0
>
> Attachments: h7383_20141108.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.requestShortCircuitFdsForRead(DataNode.java:1525)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitFds(DataXceiver.java:286)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitFds(Receiver.java:185)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:89)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-10 Thread jiangyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangyu updated HDFS-7385:
--
Target Version/s: 2.5.0, 2.4.0  (was: 2.4.0, 2.5.0)
  Status: Patch Available  (was: Open)

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.0, 2.4.0
>Reporter: jiangyu
>Assignee: jiangyu
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-10 Thread jiangyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangyu updated HDFS-7385:
--
Target Version/s: 2.5.0, 2.4.0  (was: 2.4.0, 2.5.0)
  Status: Open  (was: Patch Available)

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.0, 2.4.0
>Reporter: jiangyu
>Assignee: jiangyu
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7385) ThreadLocal used in FSEditLog class lead FSImage permission mess up

2014-11-10 Thread jiangyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangyu reassigned HDFS-7385:
-

Assignee: jiangyu

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> 
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

56 matches

Mail list logo