[jira] [Commented] (HDFS-4376) Intermittent timeout of TestBalancerWithNodeGroup

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796500#comment-13796500
 ] 

Hadoop QA commented on HDFS-4376:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608660/HDFS-4376-v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5207//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5207//console

This message is automatically generated.

 Intermittent timeout of TestBalancerWithNodeGroup
 -

 Key: HDFS-4376
 URL: https://issues.apache.org/jira/browse/HDFS-4376
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer, test
Affects Versions: 2.0.3-alpha
Reporter: Aaron T. Myers
Assignee: Junping Du
 Attachments: BalancerTest-HDFS-4376-v1.tar.gz, HDFS-4376-v1.patch, 
 HDFS-4376-v2.patch, HDFS-4376-v3.patch, 
 test-balancer-with-node-group-timeout.txt


 HDFS-4261 fixed several issues with the balancer and balancer tests, and 
 reduced the frequency with which TestBalancerWithNodeGroup times out. Despite 
 this, occasional timeouts still occur in this test. This JIRA is to track and 
 fix this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5042) Completed files lost after power failure

2013-10-16 Thread Luke Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796509#comment-13796509
 ] 

Luke Lu commented on HDFS-5042:
---

Looks like this is the most compelling reason to use XFS, where *all* 
transactions prior to the fsync() triggered log force are guaranteed to be on
disk once the fsync completes. There are no plans to change this behavior, 
either, because we rely on this architectural characteristic to provide strong 
ordering of metadata operations in many places.

 Completed files lost after power failure
 

 Key: HDFS-5042
 URL: https://issues.apache.org/jira/browse/HDFS-5042
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5)
Reporter: Dave Latham
Priority: Critical

 We suffered a cluster wide power failure after which HDFS lost data that it 
 had acknowledged as closed and complete.
 The client was HBase which compacted a set of HFiles into a new HFile, then 
 after closing the file successfully, deleted the previous versions of the 
 file.  The cluster then lost power, and when brought back up the newly 
 created file was marked CORRUPT.
 Based on reading the logs it looks like the replicas were created by the 
 DataNodes in the 'blocksBeingWritten' directory.  Then when the file was 
 closed they were moved to the 'current' directory.  After the power cycle 
 those replicas were again in the blocksBeingWritten directory of the 
 underlying file system (ext3).  When those DataNodes reported in to the 
 NameNode it deleted those replicas and lost the file.
 Some possible fixes could be having the DataNode fsync the directory(s) after 
 moving the block from blocksBeingWritten to current to ensure the rename is 
 durable or having the NameNode accept replicas from blocksBeingWritten under 
 certain circumstances.
 Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode):
 {noformat}
 RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: 
 Creating 
 file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
  with permission=rwxrwxrwx
 NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
 NameSystem.allocateBlock: 
 /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
  blk_1395839728632046111_357084589
 DN 2013-06-29 11:16:06,832 INFO 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block 
 blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: 
 /10.0.5.237:50010
 NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
 NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to 
 blk_1395839728632046111_357084589 size 25418340
 NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
 NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to 
 blk_1395839728632046111_357084589 size 25418340
 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
 NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to 
 blk_1395839728632046111_357084589 size 25418340
 DN 2013-06-29 11:16:11,385 INFO 
 org.apache.hadoop.hdfs.server.datanode.DataNode: Received block 
 blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
 DN 2013-06-29 11:16:11,385 INFO 
 org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block 
 blk_1395839728632046111_357084589 terminating
 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing 
 lease on  file 
 /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
  from client DFSClient_hb_rs_hs745,60020,1372470111932
 NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.completeFile: file 
 /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
  is closed by DFSClient_hb_rs_hs745,60020,1372470111932
 RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Renaming compacted file at 
 hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
  to 
 hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
 RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Completed major compaction of 7 file(s) in n of 
 users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into 
 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
 ---  CRASH, RESTART -
 NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
 NameSystem.addStoredBlock: addStoredBlock request received for 
 blk_1395839728632046111_357084589 on 

[jira] [Created] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)
zhaoyunjiong created HDFS-5367:
--

 Summary: Restore fsimage locked NameNode too long when the size of 
fsimage are big
 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


Our cluster have 40G fsimage, we write one copy of edit log to NFS.
After NFS temporary failed, when doing checkpoint, NameNode try to recover it, 
and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 
seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: (was: HDFS-5367)

 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong

 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: HDFS-5367

The fsimage restored when SecondaryNameNode call rollEditLog will be replaced 
soon when SecondaryNameNode call rollFsImage.
So I think restore fsimage is not necessary.

 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong

 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread zhaoyunjiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: HDFS-5367-branch-1.2.patch

This patch avoid restore fsimage to make rollEditLog finished as soon as 
possible.

 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5367-branch-1.2.patch


 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5283:


Attachment: HDFS-5283.patch

Updated the patch with comments.
Please review.

 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5368:


Description: 
Namenode entered to safemode during restart

1. After restart NN entered to safemode extention.
2. During this time deadlock happened between datanode heartbeat and 
SafemodeMonitor() thread.

{normat} Found one Java-level deadlock:
=
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
  waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
  which is held by IPC Server handler 2 on 62212
IPC Server handler 2 on 62212:
  waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
  which is held by 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953 
{noformat}

Check attached jstack for complete stack

  was:
Namenode entered to safemode during restart

1. After restart NN entered to safemode extention.
2. During this time deadlock happened between datanode heartbeat and 
SafemodeMonitor() thread.

{normat}Found one Java-level deadlock:
=
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
  waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
  which is held by IPC Server handler 2 on 62212
IPC Server handler 2 on 62212:
  waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
  which is held by 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953{noformat}

Check attached jstack for complete stack


 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinay
Assignee: Vinay
Priority: Blocker

 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 {normat} Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953 
 {noformat}
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5368:


Description: 
Namenode entered to safemode during restart

1. After restart NN entered to safemode extention.
2. During this time deadlock happened between datanode heartbeat and 
SafemodeMonitor() thread.

Found one Java-level deadlock:
=
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
  waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
  which is held by IPC Server handler 2 on 62212
IPC Server handler 2 on 62212:
  waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
  which is held by 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953

Check attached jstack for complete stack

  was:
Namenode entered to safemode during restart

1. After restart NN entered to safemode extention.
2. During this time deadlock happened between datanode heartbeat and 
SafemodeMonitor() thread.

{normat} Found one Java-level deadlock:
=
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
  waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
  which is held by IPC Server handler 2 on 62212
IPC Server handler 2 on 62212:
  waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
  which is held by 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953 
{noformat}

Check attached jstack for complete stack


 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinay
Assignee: Vinay
Priority: Blocker

 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5368:


Attachment: NN-deadlock.zip

 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: NN-deadlock.zip


 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796612#comment-13796612
 ] 

Vinay commented on HDFS-5368:
-

HDFS-3486 was fixed in Branch-1. 

 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: NN-deadlock.zip


 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5368:


Attachment: HDFS-5368.patch

Attaching a patch which takes out {{namenode.isInSafeMode()}} out of 
{{datanodeMap}} synchronization

 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5368.patch, NN-deadlock.zip


 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5368:


Status: Patch Available  (was: Open)

 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5368.patch, NN-deadlock.zip


 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5368:


Affects Version/s: 2.2.0
   3.0.0

 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0, 2.2.0
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5368.patch, NN-deadlock.zip


 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796661#comment-13796661
 ] 

Hadoop QA commented on HDFS-5283:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608680/HDFS-5283.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
  
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
  org.apache.hadoop.hdfs.TestDecommission

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5208//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5208//console

This message is automatically generated.

 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5368) Namenode deadlock during safemode extention

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796710#comment-13796710
 ] 

Hadoop QA commented on HDFS-5368:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608692/HDFS-5368.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5209//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5209//console

This message is automatically generated.

 Namenode deadlock during safemode extention
 ---

 Key: HDFS-5368
 URL: https://issues.apache.org/jira/browse/HDFS-5368
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0, 2.2.0
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5368.patch, NN-deadlock.zip


 Namenode entered to safemode during restart
 1. After restart NN entered to safemode extention.
 2. During this time deadlock happened between datanode heartbeat and 
 SafemodeMonitor() thread.
 Found one Java-level deadlock:
 =
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953:
   waiting to lock monitor 0x18c3b42c (object 0x0439c6f8, a java.util.TreeMap),
   which is held by IPC Server handler 2 on 62212
 IPC Server handler 2 on 62212:
   waiting to lock monitor 0x18c3987c (object 0x043849a0, a 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo),
   which is held by 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeMonitor@9fe953
 Check attached jstack for complete stack



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5283:


Attachment: HDFS-5283.patch

Updated the patch.
{{assert hasReadLock();}} is replaced with {{assert hasReadOrWriteLock();}}

Since {{isInSnapshot()}} is being called holding the writeLock, 
{{hasReadlock()}} returning false and assertion failed.

 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Replication queues should not be initialized in the middle of IBR processing.

2013-10-16 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-5346:
---

Status: Open  (was: Patch Available)

 Replication queues should not be initialized in the middle of IBR processing.
 -

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.patch, 
 HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796863#comment-13796863
 ] 

Hadoop QA commented on HDFS-5283:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608707/HDFS-5283.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5210//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5210//console

This message is automatically generated.

 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Replication queues should not be initialized in the middle of IBR processing.

2013-10-16 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-5346:
---

Attachment: HDFS-5346.patch

Attaching the same patch for trunk, even though the branch-23 patch applies to 
trunk with some fuzz. 

 Replication queues should not be initialized in the middle of IBR processing.
 -

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Replication queues should not be initialized in the middle of IBR processing.

2013-10-16 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-5346:
---

Status: Patch Available  (was: Open)

 Replication queues should not be initialized in the middle of IBR processing.
 -

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Replication queues should not be initialized in the middle of IBR processing.

2013-10-16 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-5346:
---

Attachment: HDFS-5346.branch-23.patch

Hmm We realized we can set dfs.namenode.replqueue.threshold-pct to 1.0 or 
even 1.5 to make sure that only when the NN enters the Safemode extension 
period are the replication queues initialized. Thus truncating the patch to 
include only the optimization for the condition to not traverse the TreeMap.

 Replication queues should not be initialized in the middle of IBR processing.
 -

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796979#comment-13796979
 ] 

Chris Nauroth commented on HDFS-5096:
-

Agreed with Andrew that we're getting close.  Almost all of my prior feedback 
has been addressed.  I found a few more small things after reviewing test code. 
 Here is the full list of remaining feedback (some of it redundant, but this 
way you don't have to look at multiple old comments).

hdfs-default.xml: Let's document 
{{dfs.namenode.path.based.cache.refresh.interval.ms}}.

{{IntrusiveCollection#addFirst}}: This method appears to be only called from 
test code.  Do you want to keep it, or is it better to delete it?

{{TestPathBasedCacheRequests#waitForCachedBlocks}}: This is another spot where 
I think we should use {{GenericTestUtils#waitFor}}.  Even though the 
JUnit-level timeouts would abort, this tends to leave the process hanging 
around.  {{GenericTestUtils#waitFor}} would throw and exit more cleanly.


 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.005.patch, 
 HDFS-5096-caching.006.patch, HDFS-5096-caching.009.patch, 
 HDFS-5096-caching.010.patch, HDFS-5096-caching.011.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (HDFS-5203) Concurrent clients that add a cache directive on the same path may prematurely uncache from each other.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth reassigned HDFS-5203:
---

Assignee: Chris Nauroth

 Concurrent clients that add a cache directive on the same path may 
 prematurely uncache from each other.
 ---

 Key: HDFS-5203
 URL: https://issues.apache.org/jira/browse/HDFS-5203
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth

 When a client adds a cache directive, we assign it a unique ID and return 
 that ID to the client.  If multiple clients add a cache directive for the 
 same path, then we return the same ID.  If one client then removes the cache 
 entry for that ID, then it is removed for all clients.  Then, when this 
 change becomes visible in subsequent cache reports, the datanodes may 
 {{munlock}} the block before the other clients are done with it.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5369) Support negative caching of user-group mapping

2013-10-16 Thread Andrew Wang (JIRA)
Andrew Wang created HDFS-5369:
-

 Summary: Support negative caching of user-group mapping
 Key: HDFS-5369
 URL: https://issues.apache.org/jira/browse/HDFS-5369
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.2.0
Reporter: Andrew Wang


We've seen a situation at a couple of our customers where interactions from an 
unknown user leads to a high-rate of group mapping calls. In one case, this was 
happening at a rate of 450 calls per second with the shell-based group mapping, 
enough to severely impact overall namenode performance and also leading to 
large amounts of log spam (prints a stack trace each time).

Let's consider negative caching of group mapping, as well as quashing the rate 
of this log message.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5203) Concurrent clients that add a cache directive on the same path may prematurely uncache from each other.

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13796994#comment-13796994
 ] 

Chris Nauroth commented on HDFS-5203:
-

Now that the big changes in HDFS-5096 are winding down, I'm planning on 
revisiting HDFS-5203 soon and preparing a patch.

 Concurrent clients that add a cache directive on the same path may 
 prematurely uncache from each other.
 ---

 Key: HDFS-5203
 URL: https://issues.apache.org/jira/browse/HDFS-5203
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth

 When a client adds a cache directive, we assign it a unique ID and return 
 that ID to the client.  If multiple clients add a cache directive for the 
 same path, then we return the same ID.  If one client then removes the cache 
 entry for that ID, then it is removed for all clients.  Then, when this 
 change becomes visible in subsequent cache reports, the datanodes may 
 {{munlock}} the block before the other clients are done with it.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5369) Support negative caching of user-group mapping

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797004#comment-13797004
 ] 

Andrew Wang commented on HDFS-5369:
---

I saw some discussion about negative-caching in HADOOP-8088, where the 
conclusion was that other services on the NN host perform caching, preventing 
expensive RTTs to do an LDAP lookup. However, ~450 shell calls per second is 
expensive even if the result is cached, and even with JNI it still seems like 
unnecessary overhead.

 Support negative caching of user-group mapping
 --

 Key: HDFS-5369
 URL: https://issues.apache.org/jira/browse/HDFS-5369
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.2.0
Reporter: Andrew Wang

 We've seen a situation at a couple of our customers where interactions from 
 an unknown user leads to a high-rate of group mapping calls. In one case, 
 this was happening at a rate of 450 calls per second with the shell-based 
 group mapping, enough to severely impact overall namenode performance and 
 also leading to large amounts of log spam (prints a stack trace each time).
 Let's consider negative caching of group mapping, as well as quashing the 
 rate of this log message.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (HDFS-5369) Support negative caching of user-group mapping

2013-10-16 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang reassigned HDFS-5369:
-

Assignee: Andrew Wang

 Support negative caching of user-group mapping
 --

 Key: HDFS-5369
 URL: https://issues.apache.org/jira/browse/HDFS-5369
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.2.0
Reporter: Andrew Wang
Assignee: Andrew Wang

 We've seen a situation at a couple of our customers where interactions from 
 an unknown user leads to a high-rate of group mapping calls. In one case, 
 this was happening at a rate of 450 calls per second with the shell-based 
 group mapping, enough to severely impact overall namenode performance and 
 also leading to large amounts of log spam (prints a stack trace each time).
 Let's consider negative caching of group mapping, as well as quashing the 
 rate of this log message.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5346) Replication queues should not be initialized in the middle of IBR processing.

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797023#comment-13797023
 ] 

Hadoop QA commented on HDFS-5346:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608720/HDFS-5346.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5211//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5211//console

This message is automatically generated.

 Replication queues should not be initialized in the middle of IBR processing.
 -

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5363) Create SPENGO-authenticated connection in URLConnectionFactory instead WebHdfsFileSystem

2013-10-16 Thread Haohui Mai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-5363:
-

Attachment: HDFS-5363.001.patch

 Create SPENGO-authenticated connection in URLConnectionFactory instead 
 WebHdfsFileSystem
 

 Key: HDFS-5363
 URL: https://issues.apache.org/jira/browse/HDFS-5363
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-5363.000.patch, HDFS-5363.001.patch


 Currently the WebHdfsSystem class creates the http connection of urls that 
 require SPENGO authentication. This patch moves the above logic to 
 URLConnectionFactory, which is the factory class that supposes to create all 
 http connection of WebHdfs client.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5363) Create SPENGO-authenticated connection in URLConnectionFactory instead WebHdfsFileSystem

2013-10-16 Thread Haohui Mai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-5363:
-

Issue Type: Sub-task  (was: Improvement)
Parent: HDFS-5305

 Create SPENGO-authenticated connection in URLConnectionFactory instead 
 WebHdfsFileSystem
 

 Key: HDFS-5363
 URL: https://issues.apache.org/jira/browse/HDFS-5363
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-5363.000.patch, HDFS-5363.001.patch


 Currently the WebHdfsSystem class creates the http connection of urls that 
 require SPENGO authentication. This patch moves the above logic to 
 URLConnectionFactory, which is the factory class that supposes to create all 
 http connection of WebHdfs client.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5247) Namenode should close editlog and unlock storage when removing failed storage dir

2013-10-16 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797045#comment-13797045
 ] 

Suresh Srinivas commented on HDFS-5247:
---

This is a rare enough problem that can be worked around by monitoring the 
available disk space. This part of the code has been quite brittle. Some of the 
changes in this area have resulted in more serious bugs and subsequent bug 
fixes for stabilization. My preference is to leave this as is, since monitoring 
disk space can avoid this issue.

 Namenode should close editlog and unlock storage when removing failed storage 
 dir
 -

 Key: HDFS-5247
 URL: https://issues.apache.org/jira/browse/HDFS-5247
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.2.1

 Attachments: HDFS-5247-branch-1.2.patch


 When one of dfs.name.dir failed, namenode didn't close editlog and unlock the 
 storage:
 java24764 hadoop   78uW  REG 252,320 393219 
 /volume1/nn/dfs/in_use.lock (deleted)
 java24764 hadoop  107u   REG 252,32  1155072 393229 
 /volume1/nn/dfs/current/edits.new (deleted)
 java24764 hadoop  119u   REG 252,320 393238 
 /volume1/nn/dfs/current/fstime.tmp
 java24764 hadoop  140u   REG 252,32  1761805 393239 
 /volume1/nn/dfs/current/edits
 If this dir is limit of space, then restore this storage may fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5358) Add replication field to PathBasedCacheDirective

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797052#comment-13797052
 ] 

Chris Nauroth commented on HDFS-5358:
-

I have a failure in {{TestOfflineEditsViewer#testStored}} since this patch.  It 
looks like we forgot to commit an updated editsStored binary file.  [~cmccabe] 
or [~andrew.wang], do you still have the correct version locally, and if so, 
would you please commit it?  Thanks!

 Add replication field to PathBasedCacheDirective
 

 Key: HDFS-5358
 URL: https://issues.apache.org/jira/browse/HDFS-5358
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: HDFS-4949

 Attachments: HDFS-5358-caching.001.patch, HDFS-5358-caching.002.patch


 Add a 'replication' field to PathBasedCacheDirective, so that administrators 
 can configure how many cached replicas of a block the cluster should try to 
 maintain.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-5096:
---

Attachment: HDFS-5096-caching.012.patch

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.005.patch, 
 HDFS-5096-caching.006.patch, HDFS-5096-caching.009.patch, 
 HDFS-5096-caching.010.patch, HDFS-5096-caching.011.patch, 
 HDFS-5096-caching.012.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5358) Add replication field to PathBasedCacheDirective

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797055#comment-13797055
 ] 

Andrew Wang commented on HDFS-5358:
---

Probably the same binary diff issue as last time. I'm +1 if anyone wants to 
just commit new files, seems unnecessary to do another JIRA.

 Add replication field to PathBasedCacheDirective
 

 Key: HDFS-5358
 URL: https://issues.apache.org/jira/browse/HDFS-5358
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: HDFS-4949

 Attachments: HDFS-5358-caching.001.patch, HDFS-5358-caching.002.patch


 Add a 'replication' field to PathBasedCacheDirective, so that administrators 
 can configure how many cached replicas of a block the cluster should try to 
 maintain.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created HDFS-5370:


 Summary: Typo in Error Message:  different between range in 
condition and range in error message
 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0


In DFSInputStream#getBlockAt, there is an  if statement with a condition 
using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated HDFS-5370:
-

Attachment: HDFS-5370.patch

I've attached a patch for this issue.

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated HDFS-5370:
-

Status: Patch Available  (was: Open)

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797078#comment-13797078
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-5283:
--

+1 patch looks good.

 Since isInSnapshot() is being called holding the writeLock, hasReadlock() 
 returning false ...

It is a bug.  Let's fix it separately.  I will file a JIRA.

 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5360) Improvement of usage message of renameSnapshot and deleteSnapshot

2013-10-16 Thread Shinichi Yamashita (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797076#comment-13797076
 ] 

Shinichi Yamashita commented on HDFS-5360:
--

Thank you for your comment. I agree with you.
The information of the argument uses a included in USAGE. So, we should confirm 
whether the number of arguments is right.
And I didn't notice about a spelling mistake.

 Improvement of usage message of renameSnapshot and deleteSnapshot
 -

 Key: HDFS-5360
 URL: https://issues.apache.org/jira/browse/HDFS-5360
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
Assignee: Shinichi Yamashita
Priority: Minor
 Attachments: HDFS-5360.patch


 When the argument of hdfs dfs -createSnapshot comamnd is inappropriate, it 
 is displayed as follows.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -createSnapshot
 -createSnapshot: snapshotDir is missing.
 Usage: hadoop fs [generic options] -createSnapshot snapshotDir 
 [snapshotName]
 {code}
 On the other hands, the commands of -renameSnapshot and -deleteSnapshot 
 is displayed as follows. And there are not kind for the user.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -renameSnapshot
 renameSnapshot: args number not 3: 0
 [hadoop@trunk ~]$ hdfs dfs -deleteSnapshot
 deleteSnapshot: args number not 2: 0
 {code}
 It changes -renameSnapshot and -deleteSnapshot to output the message 
 which is similar to -createSnapshot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5360) Improvement of usage message of renameSnapshot and deleteSnapshot

2013-10-16 Thread Shinichi Yamashita (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shinichi Yamashita updated HDFS-5360:
-

Attachment: HDFS-5360.patch

I attach a revised patch.

 Improvement of usage message of renameSnapshot and deleteSnapshot
 -

 Key: HDFS-5360
 URL: https://issues.apache.org/jira/browse/HDFS-5360
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
Assignee: Shinichi Yamashita
Priority: Minor
 Attachments: HDFS-5360.patch, HDFS-5360.patch


 When the argument of hdfs dfs -createSnapshot comamnd is inappropriate, it 
 is displayed as follows.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -createSnapshot
 -createSnapshot: snapshotDir is missing.
 Usage: hadoop fs [generic options] -createSnapshot snapshotDir 
 [snapshotName]
 {code}
 On the other hands, the commands of -renameSnapshot and -deleteSnapshot 
 is displayed as follows. And there are not kind for the user.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -renameSnapshot
 renameSnapshot: args number not 3: 0
 [hadoop@trunk ~]$ hdfs dfs -deleteSnapshot
 deleteSnapshot: args number not 2: 0
 {code}
 It changes -renameSnapshot and -deleteSnapshot to output the message 
 which is similar to -createSnapshot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5360) Improvement of usage message of renameSnapshot and deleteSnapshot

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797083#comment-13797083
 ] 

Andrew Wang commented on HDFS-5360:
---

+1 pending Jenkins, thanks for your contribution

 Improvement of usage message of renameSnapshot and deleteSnapshot
 -

 Key: HDFS-5360
 URL: https://issues.apache.org/jira/browse/HDFS-5360
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
Assignee: Shinichi Yamashita
Priority: Minor
 Attachments: HDFS-5360.patch, HDFS-5360.patch


 When the argument of hdfs dfs -createSnapshot comamnd is inappropriate, it 
 is displayed as follows.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -createSnapshot
 -createSnapshot: snapshotDir is missing.
 Usage: hadoop fs [generic options] -createSnapshot snapshotDir 
 [snapshotName]
 {code}
 On the other hands, the commands of -renameSnapshot and -deleteSnapshot 
 is displayed as follows. And there are not kind for the user.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -renameSnapshot
 renameSnapshot: args number not 3: 0
 [hadoop@trunk ~]$ hdfs dfs -deleteSnapshot
 deleteSnapshot: args number not 2: 0
 {code}
 It changes -renameSnapshot and -deleteSnapshot to output the message 
 which is similar to -createSnapshot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5371) Let client retry the same NN when dfs.client.test.drop.namenode.response.number is enabled

2013-10-16 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-5371:


Description: Currently when dfs.client.test.drop.namenode.response.number 
is enabled for testing, the client will start failover and try the other NN. 
But in most of the testing cases we do not need to trigger the client failover 
here since if the drop-response number is 1 the next response received from 
the other NN will also be dropped. We can let the client just simply retry the 
same NN.

 Let client retry the same NN when 
 dfs.client.test.drop.namenode.response.number is enabled
 

 Key: HDFS-5371
 URL: https://issues.apache.org/jira/browse/HDFS-5371
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-5371.000.patch


 Currently when dfs.client.test.drop.namenode.response.number is enabled for 
 testing, the client will start failover and try the other NN. But in most of 
 the testing cases we do not need to trigger the client failover here since if 
 the drop-response number is 1 the next response received from the other NN 
 will also be dropped. We can let the client just simply retry the same NN.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5371) Let client retry the same NN when dfs.client.test.drop.namenode.response.number is enabled

2013-10-16 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-5371:


Attachment: HDFS-5371.000.patch

 Let client retry the same NN when 
 dfs.client.test.drop.namenode.response.number is enabled
 

 Key: HDFS-5371
 URL: https://issues.apache.org/jira/browse/HDFS-5371
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-5371.000.patch


 Currently when dfs.client.test.drop.namenode.response.number is enabled for 
 testing, the client will start failover and try the other NN. But in most of 
 the testing cases we do not need to trigger the client failover here since if 
 the drop-response number is 1 the next response received from the other NN 
 will also be dropped. We can let the client just simply retry the same NN.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5371) Let client retry the same NN when dfs.client.test.drop.namenode.response.number is enabled

2013-10-16 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-5371:
---

 Summary: Let client retry the same NN when 
dfs.client.test.drop.namenode.response.number is enabled
 Key: HDFS-5371
 URL: https://issues.apache.org/jira/browse/HDFS-5371
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

2013-10-16 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797094#comment-13797094
 ] 

Benoy Antony commented on HDFS-5367:


+1.
Solves problem on our clusters. 
Please review and commit.
John , could you please provide a patch for trunk as well ?


 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5367-branch-1.2.patch


 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Tsz Wo (Nicholas), SZE (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HDFS-5283:
-

   Resolution: Fixed
Fix Version/s: 2.3.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I have committed this.  Thanks, Vinay!

 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Fix For: 2.3.0

 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5372) In FSNamesystem, hasReadLock() returns false if the current thread holds the write lock

2013-10-16 Thread Tsz Wo (Nicholas), SZE (JIRA)
Tsz Wo (Nicholas), SZE created HDFS-5372:


 Summary: In FSNamesystem, hasReadLock() returns false if the 
current thread holds the write lock
 Key: HDFS-5372
 URL: https://issues.apache.org/jira/browse/HDFS-5372
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE


This bug was discovered by [~vinayrpet] in [this 
comment|https://issues.apache.org/jira/browse/HDFS-5283?focusedCommentId=13796752page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13796752].



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797122#comment-13797122
 ] 

Colin Patrick McCabe commented on HDFS-5096:


bq. hdfs-default.xml: Let's document 
dfs.namenode.path.based.cache.refresh.interval.ms.

Added.

bq. IntrusiveCollection#addFirst: This method appears to be only called from 
test code. Do you want to keep it, or is it better to delete it?

It's a pretty small function.  I'd like to keep it in case it's needed later.  
Since we have a doubly-linked list, being able to add at the beginning or the 
end is a nice feature.

bq. TestPathBasedCacheRequests#waitForCachedBlocks: This is another spot where 
I think we should use GenericTestUtils#waitFor. Even though the JUnit-level 
timeouts would abort, this tends to leave the process hanging around. 
GenericTestUtils#waitFor would throw and exit more cleanly.

Good idea.

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.005.patch, 
 HDFS-5096-caching.006.patch, HDFS-5096-caching.009.patch, 
 HDFS-5096-caching.010.patch, HDFS-5096-caching.011.patch, 
 HDFS-5096-caching.012.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated HDFS-5370:
-

Assignee: Kousuke Saruta

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5358) Add replication field to PathBasedCacheDirective

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797130#comment-13797130
 ] 

Chris Nauroth commented on HDFS-5358:
-

Thanks, Andrew.  I just committed a fix.  The problem was that editsStored 
didn't have the replication field on the {{AddPathBasedCacheDirectiveOp}}, so 
it would fail in deserialization.  The editsStored.xml file was already updated 
to include replication though.  The easiest thing to do was to run offline 
edits viewer to convert editsStored.xml to editsStored binary and check that in.

Note however that the test won't pass until HDFS-5096 goes in.  During that 
code review, we found a {{NullPointerException}} in {{setCacheReplication}}.  
It made sense to fix it over there along with all of the refactoring that 
happened.

 Add replication field to PathBasedCacheDirective
 

 Key: HDFS-5358
 URL: https://issues.apache.org/jira/browse/HDFS-5358
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: HDFS-4949

 Attachments: HDFS-5358-caching.001.patch, HDFS-5358-caching.002.patch


 Add a 'replication' field to PathBasedCacheDirective, so that administrators 
 can configure how many cached replicas of a block the cluster should try to 
 maintain.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated HDFS-5370:
-

Hadoop Flags: Reviewed

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated HDFS-5370:
-

Affects Version/s: 2.2.0

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated HDFS-5370:
-

Fix Version/s: 2.2.1

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797136#comment-13797136
 ] 

Tsuyoshi OZAWA commented on HDFS-5370:
--

+1, LGTM. Pending Jenkins.

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (HDFS-5313) NameNode hangs during startup trying to apply OP_ADD_PATH_BASED_CACHE_DIRECTIVE.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-5313.
-

Resolution: Duplicate
  Assignee: Chris Nauroth

I've confirmed that HDFS-5096 fixes this bug.

 NameNode hangs during startup trying to apply 
 OP_ADD_PATH_BASED_CACHE_DIRECTIVE.
 

 Key: HDFS-5313
 URL: https://issues.apache.org/jira/browse/HDFS-5313
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth

 During namenode startup, if the edits contain a 
 {{OP_ADD_PATH_BASED_CACHE_DIRECTIVE}} for an existing file, then the process 
 hangs while trying to apply the op.  This is because of a call to 
 {{FSDirectory#setCacheReplication}}, which calls 
 {{FSDirectory#waitForReady}}, but of course nothing is ever going to mark the 
 directory ready, because it's still in the process of loading.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797144#comment-13797144
 ] 

Chris Nauroth commented on HDFS-5096:
-

+1 for the patch, pending resolution of feedback from Andrew too.  Thanks very 
much, Colin!

I've had a chance to take this patch for a manual test run too in a 
pseudo-distributed deployment.  I created some files in a directory, and then 
applied a cache directive on that directory.  All of the existing files got 
cached relatively quickly due to {{CacheReplicationMonitor#kick}}.  Next, I 
added some new files in the same directory.  After the 
{{dfs.namenode.path.based.cache.refresh.interval.ms}} elapsed, 
{{CacheReplicationMonitor}} scanned again and cached the new files.  I ran pmap 
to confirm that the block files were memory-mapped into the datanode process.  
I also put my namenode through a restart to confirm that we had fixed the 
hanging problem I reported in HDFS-5313.  I'll close that issue now.

It all looks good!


 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.005.patch, 
 HDFS-5096-caching.006.patch, HDFS-5096-caching.009.patch, 
 HDFS-5096-caching.010.patch, HDFS-5096-caching.011.patch, 
 HDFS-5096-caching.012.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5283) NN not coming out of startup safemode due to under construction blocks only inside snapshots also counted in safemode threshhold

2013-10-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797164#comment-13797164
 ] 

Hudson commented on HDFS-5283:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4612 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4612/])
Add TestOpenFilesWithSnapshot.java for HDFS-5283. (szetszwo: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1532860)
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestOpenFilesWithSnapshot.java
HDFS-5283. Under construction blocks only inside snapshots should not be 
counted in safemode threshhold.  Contributed by Vinay (szetszwo: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1532857)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/Namesystem.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java


 NN not coming out of startup safemode due to under construction blocks only 
 inside snapshots also counted in safemode threshhold
 

 Key: HDFS-5283
 URL: https://issues.apache.org/jira/browse/HDFS-5283
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Vinay
Assignee: Vinay
Priority: Blocker
 Fix For: 2.3.0

 Attachments: HDFS-5283.000.patch, HDFS-5283.patch, HDFS-5283.patch, 
 HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch, HDFS-5283.patch


 This is observed in one of our env:
 1. A MR Job was running which has created some temporary files and was 
 writing to them.
 2. Snapshot was taken
 3. And Job was killed and temporary files were deleted.
 4. Namenode restarted.
 5. After restart Namenode was in safemode waiting for blocks
 Analysis
 -
 1. Since the snapshot taken also includes the temporary files which were 
 open, and later original files are deleted.
 2. UnderConstruction blocks count was taken from leases. not considered the 
 UC blocks only inside snapshots
 3. So safemode threshold count was more and NN did not come out of safemode



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-5373:
---

 Summary: hdfs cacheadmin -addDirective short usage does not 
mention -replication parameter.
 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth


The short description of hdfs cacheadmin -addDirective does not mention that 
you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797184#comment-13797184
 ] 

Chris Nauroth commented on HDFS-5373:
-

The long usage does mention the -replication parameter.  The problem is limited 
to just the short usage.  This was probably just a minor oversight from 
HDFS-5358.

 hdfs cacheadmin -addDirective short usage does not mention -replication 
 parameter.
 --

 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth

 The short description of hdfs cacheadmin -addDirective does not mention that 
 you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-5373:


Priority: Trivial  (was: Major)

 hdfs cacheadmin -addDirective short usage does not mention -replication 
 parameter.
 --

 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial

 The short description of hdfs cacheadmin -addDirective does not mention that 
 you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5363) Create SPENGO-authenticated connection in URLConnectionFactory instead WebHdfsFileSystem

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797188#comment-13797188
 ] 

Hadoop QA commented on HDFS-5363:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608755/HDFS-5363.001.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestHftpURLTimeouts

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5212//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5212//console

This message is automatically generated.

 Create SPENGO-authenticated connection in URLConnectionFactory instead 
 WebHdfsFileSystem
 

 Key: HDFS-5363
 URL: https://issues.apache.org/jira/browse/HDFS-5363
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-5363.000.patch, HDFS-5363.001.patch


 Currently the WebHdfsSystem class creates the http connection of urls that 
 require SPENGO authentication. This patch moves the above logic to 
 URLConnectionFactory, which is the factory class that supposes to create all 
 http connection of WebHdfs client.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Work started] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-5373 started by Chris Nauroth.

 hdfs cacheadmin -addDirective short usage does not mention -replication 
 parameter.
 --

 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial
 Attachments: HDFS-5373.1.patch


 The short description of hdfs cacheadmin -addDirective does not mention that 
 you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-5373:


Attachment: HDFS-5373.1.patch

Here is a trivial patch to update the short usage string.  I also updated 
testCacheAdminConf.xml so that it tries passing -replication.  [~andrew.wang] 
or [~cmccabe], does this look good?

 hdfs cacheadmin -addDirective short usage does not mention -replication 
 parameter.
 --

 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial
 Attachments: HDFS-5373.1.patch


 The short description of hdfs cacheadmin -addDirective does not mention that 
 you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797200#comment-13797200
 ] 

Andrew Wang commented on HDFS-5373:
---

+1 thanks Chris

 hdfs cacheadmin -addDirective short usage does not mention -replication 
 parameter.
 --

 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial
 Attachments: HDFS-5373.1.patch


 The short description of hdfs cacheadmin -addDirective does not mention that 
 you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797215#comment-13797215
 ] 

Hadoop QA commented on HDFS-5370:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608762/HDFS-5370.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5213//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5213//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5213//console

This message is automatically generated.

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (HDFS-5373) hdfs cacheadmin -addDirective short usage does not mention -replication parameter.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-5373.
-

   Resolution: Fixed
Fix Version/s: HDFS-4949
 Hadoop Flags: Reviewed

Thanks, Andrew.  I've committed this to the HDFS-4949 branch.

 hdfs cacheadmin -addDirective short usage does not mention -replication 
 parameter.
 --

 Key: HDFS-5373
 URL: https://issues.apache.org/jira/browse/HDFS-5373
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: HDFS-4949
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Trivial
 Fix For: HDFS-4949

 Attachments: HDFS-5373.1.patch


 The short description of hdfs cacheadmin -addDirective does not mention that 
 you can set the -replication parameter.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797228#comment-13797228
 ] 

Suresh Srinivas commented on HDFS-5370:
---

+1 for the change. I do not think the Jenkins -1 is related to this 
straightforward patch.

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-5370:
--

Priority: Trivial  (was: Minor)

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 3.0.0, 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-5096:
---

Attachment: HDFS-5096-caching.002.patch

thanks, Chris.

minor fixup here: the rescan thread now removes CacheBlock objects from the 
pending uncached list for a DN if the nodes are no longer cached on that DN.

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.002.patch, 
 HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch, 
 HDFS-5096-caching.009.patch, HDFS-5096-caching.010.patch, 
 HDFS-5096-caching.011.patch, HDFS-5096-caching.012.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797246#comment-13797246
 ] 

Hudson commented on HDFS-5370:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4616 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4616/])
HDFS-5370. Typo in Error Message: different between range in condition and 
range in error message. Contributed by Kousuke Saruta. (suresh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1532899)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java


 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5370) Typo in Error Message: different between range in condition and range in error message

2013-10-16 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-5370:
--

   Resolution: Fixed
Fix Version/s: (was: 3.0.0)
   Status: Resolved  (was: Patch Available)

I have committed the patch to branch-2.2 and other branches leading up to it. 
Thank you Kousuke Saruta.

 Typo in Error Message:  different between range in condition and range in 
 error message
 ---

 Key: HDFS-5370
 URL: https://issues.apache.org/jira/browse/HDFS-5370
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0, 2.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 2.2.1

 Attachments: HDFS-5370.patch


 In DFSInputStream#getBlockAt, there is an  if statement with a condition 
 using = but the error message says .



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Avoid calling

2013-10-16 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5346:
-

Summary: Avoid calling   (was: Replication queues should not be initialized 
in the middle of IBR processing.)

 Avoid calling 
 --

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR processing

2013-10-16 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5346:
-

Summary: Avoid unnecessary call to getNumLiveDataNodes() for each block 
during IBR processing  (was: Avoid calling )

 Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR 
 processing
 

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5346) Replication queues should not be initialized in the middle of IBR processing.

2013-10-16 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797248#comment-13797248
 ] 

Kihwal Lee commented on HDFS-5346:
--

bq. We realized we can set dfs.namenode.replqueue.threshold-pct to 1.0 or even 
1.5 to make sure that only when the NN enters the Safemode extension period are 
the replication queues initialized.

Thanks for the analysis, Ravi. As you said, setting this config to something  
1.0 will prevent the replication queues from being initialized in the middle of 
block report processing.  Since the main loop of SafeModeMonitor in 
trunk/branch-2 and leaveSafeMode() called by SafeModeMonitor in branch-0.23 are 
acquiring FSN lock, nothing will get in the way between replication queue 
initialization and leaving safe mode and cause delays. 

+1 The patch looks good.  I will change the title of this jira to reflect the 
actual change.

 Replication queues should not be initialized in the middle of IBR processing.
 -

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR processing

2013-10-16 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5346:
-

Description: 
When initial block reports are being processed, checkMode() is called from 
incrementSafeBlockCount(). This causes the replication queues to be initialized 
in the middle of processing a block report in the IBR processing mode. If there 
are many block reports waiting to be processed, SafeModeMonitor won't be able 
to make name node leave the safe mode soon. It appears that the block report 
processing speed degrades considerably during this time. 

Update: The main issue can be resolved by config. The other issue of calling 

  was:When initial block reports are being processed, checkMode() is called 
from incrementSafeBlockCount(). This causes the replication queues to be 
initialized in the middle of processing a block report in the IBR processing 
mode. If there are many block reports waiting to be processed, SafeModeMonitor 
won't be able to make name node leave the safe mode soon. It appears that the 
block report processing speed degrades considerably during this time. 


 Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR 
 processing
 

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 
 Update: The main issue can be resolved by config. The other issue of calling 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR processing

2013-10-16 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5346:
-

Description: 
When initial block reports are being processed, checkMode() is called from 
incrementSafeBlockCount(). This causes the replication queues to be initialized 
in the middle of processing a block report in the IBR processing mode. If there 
are many block reports waiting to be processed, SafeModeMonitor won't be able 
to make name node leave the safe mode soon. It appears that the block report 
processing speed degrades considerably during this time. 

Update: The main issue can be resolved by config. The other issue of calling 
getNumLiveDataNodes() for each block in the block report will be addressed in 
this jira

  was:
When initial block reports are being processed, checkMode() is called from 
incrementSafeBlockCount(). This causes the replication queues to be initialized 
in the middle of processing a block report in the IBR processing mode. If there 
are many block reports waiting to be processed, SafeModeMonitor won't be able 
to make name node leave the safe mode soon. It appears that the block report 
processing speed degrades considerably during this time. 

Update: The main issue can be resolved by config. The other issue of calling 


 Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR 
 processing
 

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 
 Update: The main issue can be resolved by config. The other issue of calling 
 getNumLiveDataNodes() for each block in the block report will be addressed in 
 this jira



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5346) Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR processing

2013-10-16 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5346:
-

   Resolution: Fixed
Fix Version/s: 3.0.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I've committed this to branch-0.23, branch-2 and trunk. Thanks for working on 
the fix, Ravi.

 Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR 
 processing
 

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 3.0.0, 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 
 Update: The main issue can be resolved by config. The other issue of calling 
 getNumLiveDataNodes() for each block in the block report will be addressed in 
 this jira



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797270#comment-13797270
 ] 

Andrew Wang commented on HDFS-5096:
---

I'm +1 pending some nitty things. It's mostly just rolling through my previous 
comments. Great work Colin!

CachedBlock
* Add a little more to the {{CachedBlock#triplets}} javadoc that specifies 
{{element, prev, next}}. You could even just copy the javadoc from 
{{BlockInfo}}.
* getDatanodes javadoc should mention pending uncached blocks too
* Class javadoc explaining the use of the GSet and IntrusiveCollection
* A short is 16 bits, the comment on {{replicationAndMark}} indicates it's 8 
bits.

CacheReplicationMonitor
* I think this should be a {{=}}:
{code}
  if (numCached  neededCached) {
{code}

Follow-on work (some might just be part of HDFS-5366):
* Refactor out a separate {{CacheReplicationPolicy}} class with more smarts (is 
this HDFS-5366?)
* Take into account DN decomissioning status when doing caching/uncaching, this 
should be easy to fix as part of HDFS-5366
* Incremental kicking of the CRMon on PBCE changes
* Kicking on a DN failure

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.002.patch, 
 HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch, 
 HDFS-5096-caching.009.patch, HDFS-5096-caching.010.patch, 
 HDFS-5096-caching.011.patch, HDFS-5096-caching.012.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5374) Remove deadcode in DFSOutputStream

2013-10-16 Thread Suresh Srinivas (JIRA)
Suresh Srinivas created HDFS-5374:
-

 Summary: Remove deadcode in DFSOutputStream
 Key: HDFS-5374
 URL: https://issues.apache.org/jira/browse/HDFS-5374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Suresh Srinivas
Priority: Trivial
 Attachments: HDFS-4374.patch

Deadcode:
{code}
  if (one.isHeartbeatPacket()) {  //heartbeat packet
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (HDFS-5374) Remove deadcode in DFSOutputStream

2013-10-16 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas reassigned HDFS-5374:
-

Assignee: Suresh Srinivas

 Remove deadcode in DFSOutputStream
 --

 Key: HDFS-5374
 URL: https://issues.apache.org/jira/browse/HDFS-5374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas
Priority: Trivial
 Attachments: HDFS-4374.patch


 Deadcode:
 {code}
   if (one.isHeartbeatPacket()) {  //heartbeat packet
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5346) Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR processing

2013-10-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797275#comment-13797275
 ] 

Hudson commented on HDFS-5346:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #4618 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4618/])
HDFS-5346. Avoid unnecessary call to getNumLiveDataNodes() for each block 
during IBR processing. Contributed by Ravi Prakash. (kihwal: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1532915)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java


 Avoid unnecessary call to getNumLiveDataNodes() for each block during IBR 
 processing
 

 Key: HDFS-5346
 URL: https://issues.apache.org/jira/browse/HDFS-5346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, performance
Affects Versions: 0.23.9, 2.3.0
Reporter: Kihwal Lee
Assignee: Ravi Prakash
 Fix For: 3.0.0, 2.3.0, 0.23.10

 Attachments: HDFS-5346.branch-23.patch, HDFS-5346.branch-23.patch, 
 HDFS-5346.patch, HDFS-5346.patch, HDFS-5346.patch


 When initial block reports are being processed, checkMode() is called from 
 incrementSafeBlockCount(). This causes the replication queues to be 
 initialized in the middle of processing a block report in the IBR processing 
 mode. If there are many block reports waiting to be processed, 
 SafeModeMonitor won't be able to make name node leave the safe mode soon. It 
 appears that the block report processing speed degrades considerably during 
 this time. 
 Update: The main issue can be resolved by config. The other issue of calling 
 getNumLiveDataNodes() for each block in the block report will be addressed in 
 this jira



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5374) Remove deadcode in DFSOutputStream

2013-10-16 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-5374:
--

Status: Patch Available  (was: Open)

 Remove deadcode in DFSOutputStream
 --

 Key: HDFS-5374
 URL: https://issues.apache.org/jira/browse/HDFS-5374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas
Priority: Trivial
 Attachments: HDFS-4374.patch


 Deadcode:
 {code}
   if (one.isHeartbeatPacket()) {  //heartbeat packet
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5374) Remove deadcode in DFSOutputStream

2013-10-16 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-5374:
--

Attachment: HDFS-4374.patch

Removed the dead code. I also fixed some typos and java warnings.

 Remove deadcode in DFSOutputStream
 --

 Key: HDFS-5374
 URL: https://issues.apache.org/jira/browse/HDFS-5374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas
Priority: Trivial
 Attachments: HDFS-4374.patch


 Deadcode:
 {code}
   if (one.isHeartbeatPacket()) {  //heartbeat packet
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5371) Let client retry the same NN when dfs.client.test.drop.namenode.response.number is enabled

2013-10-16 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-5371:


Status: Patch Available  (was: Open)

 Let client retry the same NN when 
 dfs.client.test.drop.namenode.response.number is enabled
 

 Key: HDFS-5371
 URL: https://issues.apache.org/jira/browse/HDFS-5371
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-5371.000.patch


 Currently when dfs.client.test.drop.namenode.response.number is enabled for 
 testing, the client will start failover and try the other NN. But in most of 
 the testing cases we do not need to trigger the client failover here since if 
 the drop-response number is 1 the next response received from the other NN 
 will also be dropped. We can let the client just simply retry the same NN.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5374) Remove deadcode in DFSOutputStream

2013-10-16 Thread Brandon Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797296#comment-13797296
 ] 

Brandon Li commented on HDFS-5374:
--

+1

 Remove deadcode in DFSOutputStream
 --

 Key: HDFS-5374
 URL: https://issues.apache.org/jira/browse/HDFS-5374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Suresh Srinivas
Assignee: Suresh Srinivas
Priority: Trivial
 Attachments: HDFS-4374.patch


 Deadcode:
 {code}
   if (one.isHeartbeatPacket()) {  //heartbeat packet
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5375) hdfs.cmd does not expose several snapshot commands.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-5375:


Attachment: HDFS-5375.1.patch

Here is a patch to add the commands to the cmd file.  Thanks to [~rramya] for 
finding and reporting the bug.

 hdfs.cmd does not expose several snapshot commands.
 ---

 Key: HDFS-5375
 URL: https://issues.apache.org/jira/browse/HDFS-5375
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.2.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Minor
 Attachments: HDFS-5375.1.patch


 We need to update hdfs.cmd to expose the snapshotDiff and lsSnapshottableDir 
 commands on Windows.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5375) hdfs.cmd does not expose several snapshot commands.

2013-10-16 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-5375:


Status: Patch Available  (was: Open)

 hdfs.cmd does not expose several snapshot commands.
 ---

 Key: HDFS-5375
 URL: https://issues.apache.org/jira/browse/HDFS-5375
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.2.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Minor
 Attachments: HDFS-5375.1.patch


 We need to update hdfs.cmd to expose the snapshotDiff and lsSnapshottableDir 
 commands on Windows.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5375) hdfs.cmd does not expose several snapshot commands.

2013-10-16 Thread Chris Nauroth (JIRA)
Chris Nauroth created HDFS-5375:
---

 Summary: hdfs.cmd does not expose several snapshot commands.
 Key: HDFS-5375
 URL: https://issues.apache.org/jira/browse/HDFS-5375
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.2.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Minor


We need to update hdfs.cmd to expose the snapshotDiff and lsSnapshottableDir 
commands on Windows.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-10-16 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797319#comment-13797319
 ] 

Andrew Wang commented on HDFS-5366:
---

One interesting idea from the block replication code is having priorities for 
replication work based on the current and expected replication factor. Maybe a 
0 of 3 case should be rescheduled elsewhere more quickly than the 10.5 minute 
dead datanode interval, while we let a mild case of 2 of 3 sit.

I don't think this will require tracking our own list of stale or dead 
nodes, just a list of nodes we've already tried for an outstanding request. We 
reset if we've tried all targets. I seem to remember the block recovery code or 
something doing this. Avoiding stale nodes might also be good enough, if we 
think that heartbeats are a good proxy for the DN's ability to cache/uncache. 
This probably isn't true for uncaching though, since as you've noted, a hung 
client could just hold onto a ZCR lease.

 recaching improvements
 --

 Key: HDFS-5366
 URL: https://issues.apache.org/jira/browse/HDFS-5366
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 There are a few things about our HDFS-4949 recaching strategy that could be 
 improved.
 * We should monitor the DN's maximum and current mlock'ed memory consumption 
 levels, so that we don't ask the DN to do stuff it can't.
 * We should not try to initiate caching on stale DataNodes (although we 
 should not recache things stored on such nodes until they're declared dead).
 * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
 times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5375) hdfs.cmd does not expose several snapshot commands.

2013-10-16 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797317#comment-13797317
 ] 

Jing Zhao commented on HDFS-5375:
-

The patch looks pretty good to me. +1

 hdfs.cmd does not expose several snapshot commands.
 ---

 Key: HDFS-5375
 URL: https://issues.apache.org/jira/browse/HDFS-5375
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.2.0
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Minor
 Attachments: HDFS-5375.1.patch


 We need to update hdfs.cmd to expose the snapshotDiff and lsSnapshottableDir 
 commands on Windows.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-5096:
---

Attachment: HDFS-5096-caching.014.patch

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.002.patch, 
 HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch, 
 HDFS-5096-caching.009.patch, HDFS-5096-caching.010.patch, 
 HDFS-5096-caching.011.patch, HDFS-5096-caching.012.patch, 
 HDFS-5096-caching.014.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5366) recaching improvements

2013-10-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797325#comment-13797325
 ] 

Colin Patrick McCabe commented on HDFS-5366:


As andrew pointed out on HDFS-5096, we should also kick the CRMon on a DN 
failure.  We should also avoid scheduling new work on decommissioning nodes (as 
well as stale nodes)

 recaching improvements
 --

 Key: HDFS-5366
 URL: https://issues.apache.org/jira/browse/HDFS-5366
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe

 There are a few things about our HDFS-4949 recaching strategy that could be 
 improved.
 * We should monitor the DN's maximum and current mlock'ed memory consumption 
 levels, so that we don't ask the DN to do stuff it can't.
 * We should not try to initiate caching on stale DataNodes (although we 
 should not recache things stored on such nodes until they're declared dead).
 * We might want to resend the {{DNA_CACHE}} or {{DNA_UNCACHE}} command a few 
 times before giving up.  Currently, we only send it once.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797324#comment-13797324
 ] 

Colin Patrick McCabe commented on HDFS-5096:


bq. Add a little more to the CachedBlock#triplets javadoc that specifies 
element, prev, next. You could even just copy the javadoc from BlockInfo.

ok

bq. Class javadoc explaining the use of the GSet and IntrusiveCollection

ok

bq. A short is 16 bits, the comment on replicationAndMark indicates it's 8 bits.

fixed

bq. I think this should be a =:

agree

for the follow-on work, I added a comment about kicking on a DN failure and 
avoiding decomissioned DNs to HDFS-5366

incremental rescan is down the road, I think.  we should do the pool management 
stuff before that...

thanks.  will commit shortly.

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Attachments: HDFS-5096-caching.002.patch, 
 HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch, 
 HDFS-5096-caching.009.patch, HDFS-5096-caching.010.patch, 
 HDFS-5096-caching.011.patch, HDFS-5096-caching.012.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5371) Let client retry the same NN when dfs.client.test.drop.namenode.response.number is enabled

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797350#comment-13797350
 ] 

Hadoop QA commented on HDFS-5371:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608767/HDFS-5371.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5216//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5216//console

This message is automatically generated.

 Let client retry the same NN when 
 dfs.client.test.drop.namenode.response.number is enabled
 

 Key: HDFS-5371
 URL: https://issues.apache.org/jira/browse/HDFS-5371
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.0.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Minor
 Attachments: HDFS-5371.000.patch


 Currently when dfs.client.test.drop.namenode.response.number is enabled for 
 testing, the client will start failover and try the other NN. But in most of 
 the testing cases we do not need to trigger the client failover here since if 
 the drop-response number is 1 the next response received from the other NN 
 will also be dropped. We can let the client just simply retry the same NN.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5336) DataNode should not output 'StartupProgress' metrics

2013-10-16 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797354#comment-13797354
 ] 

Chris Nauroth commented on HDFS-5336:
-

Thanks for the patch, Akira.  I built and verified that startup progress 
metrics only showed up in namenode and not datanode.

bq. Change the context of the startup metrics from 'default' to 'dfs'.

Unfortunately, I think this would be backwards-incompatible.  For example, if 
someone was using metrics filtering, then their filtering configuration under 
the old context would stop working.  Can we please remove this part of the 
change?


 DataNode should not output 'StartupProgress' metrics
 

 Key: HDFS-5336
 URL: https://issues.apache.org/jira/browse/HDFS-5336
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.1.0-beta
 Environment: trunk
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Minor
  Labels: metrics
 Attachments: HDFS-5336.patch


 I found the following metrics output from DataNode.
 {code}
 1381355455731 default.StartupProgress: Hostname=trunk, ElapsedTime=0, 
 PercentComplete=0.0, LoadingFsImageCount=0, LoadingFsImageElapsedTime=0, 
 LoadingFsImageTotal=0, LoadingFsImagePercentComplete=0.0, 
 LoadingEditsCount=0, LoadingEditsElapsedTime=0, LoadingEditsTotal=0, 
 LoadingEditsPercentComplete=0.0, SavingCheckpointCount=0, 
 SavingCheckpointElapsedTime=0, SavingCheckpointTotal=0, 
 SavingCheckpointPercentComplete=0.0, SafeModeCount=0, SafeModeElapsedTime=0, 
 SafeModeTotal=0, SafeModePercentComplete=0.0
 {code}
 DataNode should not output 'StartupProgress' metrics because the metrics 
 shows the progress of NameNode startup.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (HDFS-5096) Automatically cache new data added to a cached path

2013-10-16 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe resolved HDFS-5096.


  Resolution: Fixed
   Fix Version/s: HDFS-4949
Target Version/s: HDFS-4949

thanks for the reviews, Andrew and Chris.

 Automatically cache new data added to a cached path
 ---

 Key: HDFS-5096
 URL: https://issues.apache.org/jira/browse/HDFS-5096
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode, namenode
Reporter: Andrew Wang
Assignee: Colin Patrick McCabe
 Fix For: HDFS-4949

 Attachments: HDFS-5096-caching.002.patch, 
 HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch, 
 HDFS-5096-caching.009.patch, HDFS-5096-caching.010.patch, 
 HDFS-5096-caching.011.patch, HDFS-5096-caching.012.patch, 
 HDFS-5096-caching.014.patch


 For some applications, it's convenient to specify a path to cache, and have 
 HDFS automatically cache new data added to the path without sending a new 
 caching request or a manual refresh command.
 One example is new data appended to a cached file. It would be nice to 
 re-cache a block at the new appended length, and cache new blocks added to 
 the file.
 Another example is a cached Hive partition directory, where a user can drop 
 new files directly into the partition. It would be nice if these new files 
 were cached.
 In both cases, this automatic caching would happen after the file is closed, 
 i.e. block replica is finalized.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5376) Incremental rescanning of cached blocks and cache entries

2013-10-16 Thread Andrew Wang (JIRA)
Andrew Wang created HDFS-5376:
-

 Summary: Incremental rescanning of cached blocks and cache entries
 Key: HDFS-5376
 URL: https://issues.apache.org/jira/browse/HDFS-5376
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-4949
Reporter: Andrew Wang
Assignee: Andrew Wang


{{CacheReplicationMonitor#rescan}} is invoked whenever a new cache entry is 
added or removed. This involves a complete rescan of all cache entries and 
cached blocks, which is potentially expensive. It'd be better to do an 
incremental scan instead. This would also let us incrementally re-scan on 
namespace changes like rename and create for better caching latency.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HDFS-5360) Improvement of usage message of renameSnapshot and deleteSnapshot

2013-10-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797339#comment-13797339
 ] 

Hadoop QA commented on HDFS-5360:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12608765/HDFS-5360.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/5214//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5214//console

This message is automatically generated.

 Improvement of usage message of renameSnapshot and deleteSnapshot
 -

 Key: HDFS-5360
 URL: https://issues.apache.org/jira/browse/HDFS-5360
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 3.0.0
Reporter: Shinichi Yamashita
Assignee: Shinichi Yamashita
Priority: Minor
 Attachments: HDFS-5360.patch, HDFS-5360.patch


 When the argument of hdfs dfs -createSnapshot comamnd is inappropriate, it 
 is displayed as follows.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -createSnapshot
 -createSnapshot: snapshotDir is missing.
 Usage: hadoop fs [generic options] -createSnapshot snapshotDir 
 [snapshotName]
 {code}
 On the other hands, the commands of -renameSnapshot and -deleteSnapshot 
 is displayed as follows. And there are not kind for the user.
 {code}
 [hadoop@trunk ~]$ hdfs dfs -renameSnapshot
 renameSnapshot: args number not 3: 0
 [hadoop@trunk ~]$ hdfs dfs -deleteSnapshot
 deleteSnapshot: args number not 2: 0
 {code}
 It changes -renameSnapshot and -deleteSnapshot to output the message 
 which is similar to -createSnapshot.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5336) DataNode should not output 'StartupProgress' metrics

2013-10-16 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated HDFS-5336:


Attachment: HDFS-5336.2.patch

 DataNode should not output 'StartupProgress' metrics
 

 Key: HDFS-5336
 URL: https://issues.apache.org/jira/browse/HDFS-5336
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.1.0-beta
 Environment: trunk
Reporter: Akira AJISAKA
Assignee: Akira AJISAKA
Priority: Minor
  Labels: metrics
 Attachments: HDFS-5336.2.patch, HDFS-5336.patch


 I found the following metrics output from DataNode.
 {code}
 1381355455731 default.StartupProgress: Hostname=trunk, ElapsedTime=0, 
 PercentComplete=0.0, LoadingFsImageCount=0, LoadingFsImageElapsedTime=0, 
 LoadingFsImageTotal=0, LoadingFsImagePercentComplete=0.0, 
 LoadingEditsCount=0, LoadingEditsElapsedTime=0, LoadingEditsTotal=0, 
 LoadingEditsPercentComplete=0.0, SavingCheckpointCount=0, 
 SavingCheckpointElapsedTime=0, SavingCheckpointTotal=0, 
 SavingCheckpointPercentComplete=0.0, SafeModeCount=0, SafeModeElapsedTime=0, 
 SafeModeTotal=0, SafeModePercentComplete=0.0
 {code}
 DataNode should not output 'StartupProgress' metrics because the metrics 
 shows the progress of NameNode startup.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


  1   2   >