date:20140125

[jira] [Commented] (HDFS-5832) Deadlock found in NN between SafeMode#canLeave and DatanodeManager#handleHeartbeat

2014-01-25 Thread Vinay (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882189#comment-13882189
 ] 

Vinay commented on HDFS-5832:
-

As mentioned in HDFS-5132, 
Moving SafemodeMonitor#run() checks under fsn write lock, will solve the issue. 

1. handleHeartbeat() is always done under fsn readlock
2. incrementSafeBlockCount() and getNumLivedatanodes() will always will be 
called under writeLock().

By directly seeing the synchronization order it appears to be deadlock. But its 
avoided by the fsn lock.
 I think jcarder will not identify the read-write lock mechanism.

For this reason only I have made HDFS-5368 duplicate of HDFS-5132

> Deadlock found in NN between SafeMode#canLeave and 
> DatanodeManager#handleHeartbeat
> --
>
> Key: HDFS-5832
> URL: https://issues.apache.org/jira/browse/HDFS-5832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Rakesh R
>Assignee: Rakesh R
>Priority: Blocker
> Attachments: HDFS-5832.patch, jcarder_nn_deadlock.gif
>
>
> Found the deadlock during the Namenode startup. Attached jcarder report which 
> shows the cycles about the deadlock situation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5832) Deadlock found in NN between SafeMode#canLeave and DatanodeManager#handleHeartbeat

2014-01-25 Thread Vinay (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HDFS-5832:


Summary: Deadlock found in NN between SafeMode#canLeave and 
DatanodeManager#handleHeartbeat  (was: Deadeadlock found in NN between 
SafeMode#canLeave and DatanodeManager#handleHeartbeat)

> Deadlock found in NN between SafeMode#canLeave and 
> DatanodeManager#handleHeartbeat
> --
>
> Key: HDFS-5832
> URL: https://issues.apache.org/jira/browse/HDFS-5832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Rakesh R
>Assignee: Rakesh R
>Priority: Blocker
> Attachments: HDFS-5832.patch, jcarder_nn_deadlock.gif
>
>
> Found the deadlock during the Namenode startup. Attached jcarder report which 
> shows the cycles about the deadlock situation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (HDFS-5827) Children are not inheriting parent's default ACLs

2014-01-25 Thread Vinay (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay resolved HDFS-5827.
-

Resolution: Duplicate

This will be implemented in HDFS-5616
And tests are already added as part of HDFS-5702.
So closing this as duplicate.

Thanks Chris for the update

> Children are not inheriting parent's default ACLs
> -
>
> Key: HDFS-5827
> URL: https://issues.apache.org/jira/browse/HDFS-5827
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Vinay
>Assignee: Chris Nauroth
>
> Children are not inheriting the parent's default ACLs on creation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5709) Improve upgrade with existing files and directories named ".snapshot"

2014-01-25 Thread Suresh Srinivas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882103#comment-13882103
 ] 

Suresh Srinivas commented on HDFS-5709:
---

bq. This feels procedurally pretty similar to the alternative of a special 
startup option, and it's only required to be set if we hit a .snapshot path. 
Most users will never need to worry about this config.
The problem is, now we have a configuration that needs to be added to NameNode 
for one time upgrade. The configuration lingers on for ever, unless and 
informed user can get rid of it. That is the reason why configuration is 
perhaps not the best way to do this.

The mechanism I and [~jingzhao] talked about addresses .reserved as well. Do 
you plan to address that in this jira?

bq. ...that implies scanning the fsimage and logs initially to enumerate all 
the .snapshot paths, have the user configure their rename rules...
Or, just rename it based on convention as I had proposed. User can either glean 
this from the log (preferred) or from searching fsimage the renamed paths and 
rename them as they wish. Given that the likelihood of users running into this 
conflict in the first place is low, this should be acceptable.

> Improve upgrade with existing files and directories named ".snapshot"
> -
>
> Key: HDFS-5709
> URL: https://issues.apache.org/jira/browse/HDFS-5709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>  Labels: snapshots, upgrade
> Attachments: hdfs-5709-1.patch, hdfs-5709-2.patch, hdfs-5709-3.patch, 
> hdfs-5709-4.patch, hdfs-5709-5.patch
>
>
> Right now in trunk, upgrade fails messily if the old fsimage or edits refer 
> to a directory named ".snapshot". We should at least print a better error 
> message (which I believe was the original intention in HDFS-4666), and [~atm] 
> proposed automatically renaming these files and directories.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5138) Support HDFS upgrade in HA

2014-01-25 Thread Suresh Srinivas (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882096#comment-13882096
]

Suresh Srinivas commented on HDFS-5138:
---

@Todd, I have had some conversation about this [~atm] related to this jira. I
had brought up one issue about potentially losing editlogs on JournalNode. I
thought that would be addressed before this jira can be committed. I have been
very busy and have not been able to provide all my comments. Reviewing this
patch has been quite tricky. Here are my almost complete review comments. While
some of the issues are minor nits, I do not think this patch and the
documentation is ready.

I am adding information about the design, the way I understand it. Let me know
if I got it wrong.
*Upgrade preparation:*
# New bits are installed on the cluster nodes.
# The cluster is brought down.

*Upgrade:* For HA setup, choose one of the namenodes to initiate upgrade on and
start it with -upgrade flag.
# NN performs preupgrade for all non shared storage directories by moving
current to previous.tmp and creating new current.
#* Failure here is fine. NN start up fails. Next attempt at upgrade the storage
directories are recovered.
# NN performs preupgrade of shared edits (NFS/JournalNodes) over RPC.
JournalNodes current moved to previous.tmp and new current is created.
#* If one of the JN preupgrade fails and upgrade is reattempted, editlog
directory could be lost on the JN. Restarting the JN does not fix the issue.
# NN performs upgrade of non shared edits by writing new CTIME to current and
moving previous.tmp to previous.
#* If one of the JN preupgrade fails and upgrade is reattempted, editlog
directory could be lost on the JN. Restarting the JN does not fix the issue.
# NN performs upgrade of shared edits (NFS/JournalNodes) over RPC. JournalNodes
current has new CTIM and previous.tmp is moved to previous.
# We need to document that all the JournalNodes must be up. If a JN is
irrecoverably lost, configuration must be changed to exclude the JN.

*Rollback:* NN is started with rollback flag
# For all the non shared directories, the NN checks for canRollBack,
essentially ensures that previous directory with the right layout version
exists.
# For all the shared directories, the NN checks for canRollBack, essentially
ensures that previous directory with the right layout version exists.
# NN performs rollback for shared directories (moving previous to current)
#* If rollback of one of the JN fails, then directories are in inconsistent
state. I think any attempt at retrying rollback will fail and will require
manually moving files around. I do not think restarting JN fixes this.
# We need to document that all the JournalNodes must be up. If a JN is
irrecoverably lost, configuration must be changed to exclude the JN.

*Finalize:* DFSAdmin command is run to finalize the upgrade.
# Active NN performs finalizing of editlog. If JN's fail to finalize, active NN
fails to finalize. However it is possible that standby finalizes, leaving the
cluster in an inconsistent state.
# We need to document that all the JournalNodes must be up. If a JN is
irrecoverably lost, configuration must be changed to exclude the JN.

Comments on the code in the patch (this is almost complete):
Comments:
# Minor nit: there are some white space changes
# assertAllResultsEqual - for loop can just start with i = 1? Also if the
collection objects is of size zero or one, the method can return early. Is
there a need to do object.toArray() for these early checks? With that, perhaps
the findbugs exclude may not be necessary.
# Unit test can be added for methods isAtLeastOneActive,
getRpcAddressesForNameserviceId and getProxiesForAllNameNodesInNameservice (I
am okay if this is done in a separate jira)
# Finalizing upgrade is quite tricky. Consider the following scenarios:
#* One NN is active and the other is standby - works fine
#* One NN is active and the other is down or all NNs - finalize command throws
exception and the user will not know if it has succeeded or failed and what to
do next
#* No active NN - throws an exception cannot finalize with no active
#* BlockPoolSliceStorage.java change seems unnecessary
# Why is {{throw new AssertionError("Unreachable code.");}} in
QuorumJournalManager.java methods?
# FSImage#doRollBack() - when canRollBack is false after checking if non-share
directories can rollback, an exception must be immediately thrown, instead of
checking shared editlog. Also printing Log.info when storages can be rolled
back will help in debugging.
# FSEditlog#canRollBackSharedLog should accept StorageInfo instead of Storage
# QuorumJournalManager#canRollBack and getJournalCTime can throw AssertionError
(from DFSUtil.assertAllResultsEqual()). Is that the right exception to expose
or IOException?
# Namenode startup throws AssertionError with -rollback option. I think w

[jira] [Commented] (HDFS-5776) Support 'hedged' reads in DFSClient

2014-01-25 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882037#comment-13882037
 ] 

stack commented on HDFS-5776:
-

bq. I might be misunderstanding, but it seems like this should be a client 
setting, not a datanode setting. Right?

[~cmccabe] You are correct.  I had it wrong.  s/restart DN/restart 
client/regionserver/ in the above.  Thanks C.

> Support 'hedged' reads in DFSClient
> ---
>
> Key: HDFS-5776
> URL: https://issues.apache.org/jira/browse/HDFS-5776
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.0.0
>Reporter: Liang Xie
>Assignee: Liang Xie
> Attachments: HDFS-5776-v2.txt, HDFS-5776-v3.txt, HDFS-5776-v4.txt, 
> HDFS-5776-v5.txt, HDFS-5776-v6.txt, HDFS-5776-v7.txt, HDFS-5776-v8.txt, 
> HDFS-5776.txt
>
>
> This is a placeholder of hdfs related stuff backport from 
> https://issues.apache.org/jira/browse/HBASE-7509
> The quorum read ability should be helpful especially to optimize read outliers
> we can utilize "dfs.dfsclient.quorum.read.threshold.millis" & 
> "dfs.dfsclient.quorum.read.threadpool.size" to enable/disable the hedged read 
> ability from client side(e.g. HBase), and by using DFSQuorumReadMetrics, we 
> could export the interested metric valus into client system(e.g. HBase's 
> regionserver metric).
> The core logic is in pread code path, we decide to goto the original 
> fetchBlockByteRange or the new introduced fetchBlockByteRangeSpeculative per 
> the above config items.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5138) Support HDFS upgrade in HA

2014-01-25 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882035#comment-13882035
 ] 

Hadoop QA commented on HDFS-5138:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12625212/hdfs-5138-branch-2.txt
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/5944//console

This message is automatically generated.

> Support HDFS upgrade in HA
> --
>
> Key: HDFS-5138
> URL: https://issues.apache.org/jira/browse/HDFS-5138
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.1.1-beta
>Reporter: Kihwal Lee
>Assignee: Aaron T. Myers
>Priority: Blocker
> Attachments: HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, 
> HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, 
> HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, 
> hdfs-5138-branch-2.txt
>
>
> With HA enabled, NN wo't start with "-upgrade". Since there has been a layout 
> version change between 2.0.x and 2.1.x, starting NN in upgrade mode was 
> necessary when deploying 2.1.x to an existing 2.0.x cluster. But the only way 
> to get around this was to disable HA and upgrade. 
> The NN and the cluster cannot be flipped back to HA until the upgrade is 
> finalized. If HA is disabled only on NN for layout upgrade and HA is turned 
> back on without involving DNs, things will work, but finaliizeUpgrade won't 
> work (the NN is in HA and it cannot be in upgrade mode) and DN's upgrade 
> snapshots won't get removed.
> We will need a different ways of doing layout upgrade and upgrade snapshot.  
> I am marking this as a 2.1.1-beta blocker based on feedback from others.  If 
> there is a reasonable workaround that does not increase maintenance window 
> greatly, we can lower its priority from blocker to critical.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5138) Support HDFS upgrade in HA

2014-01-25 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-5138:
--

Attachment: hdfs-5138-branch-2.txt

I cherry-picked HDFS-5721 and HDFS-5719 to branch-2, since they had some 
trivial changes which changed a lot of indentation (thus creating conflicts). 
After doing so, the backport of this JIRA was pretty clean (just imports and 
some changed context in DFSUtil). I ran all the modified test suites on 
branch-2 with this backport patch and they passed. ATM, can you take a quick 
look at this?

> Support HDFS upgrade in HA
> --
>
> Key: HDFS-5138
> URL: https://issues.apache.org/jira/browse/HDFS-5138
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.1.1-beta
>Reporter: Kihwal Lee
>Assignee: Aaron T. Myers
>Priority: Blocker
> Attachments: HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, 
> HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, 
> HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, HDFS-5138.patch, 
> hdfs-5138-branch-2.txt
>
>
> With HA enabled, NN wo't start with "-upgrade". Since there has been a layout 
> version change between 2.0.x and 2.1.x, starting NN in upgrade mode was 
> necessary when deploying 2.1.x to an existing 2.0.x cluster. But the only way 
> to get around this was to disable HA and upgrade. 
> The NN and the cluster cannot be flipped back to HA until the upgrade is 
> finalized. If HA is disabled only on NN for layout upgrade and HA is turned 
> back on without involving DNs, things will work, but finaliizeUpgrade won't 
> work (the NN is in HA and it cannot be in upgrade mode) and DN's upgrade 
> snapshots won't get removed.
> We will need a different ways of doing layout upgrade and upgrade snapshot.  
> I am marking this as a 2.1.1-beta blocker based on feedback from others.  If 
> there is a reasonable workaround that does not increase maintenance window 
> greatly, we can lower its priority from blocker to critical.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5721) sharedEditsImage in Namenode#initializeSharedEdits() should be closed before method returns

2014-01-25 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-5721:
--

Fix Version/s: 2.4.0

Cherry-picked to branch-2 to make HDFS-5138 apply more cleanly.

> sharedEditsImage in Namenode#initializeSharedEdits() should be closed before 
> method returns
> ---
>
> Key: HDFS-5721
> URL: https://issues.apache.org/jira/browse/HDFS-5721
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 2.4.0
>
> Attachments: hdfs-5721-v1.txt, hdfs-5721-v2.txt, hdfs-5721-v3.txt
>
>
> At line 901:
> {code}
>   FSImage sharedEditsImage = new FSImage(conf,
>   Lists.newArrayList(),
>   sharedEditsDirs);
> {code}
> sharedEditsImage is not closed before the method returns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5719) FSImage#doRollback() should close prevState before return

2014-01-25 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-5719:
--

Fix Version/s: 2.4.0

Cherry-picked this to branch-2 to make HDFS-5138 apply cleaner.

> FSImage#doRollback() should close prevState before return
> -
>
> Key: HDFS-5719
> URL: https://issues.apache.org/jira/browse/HDFS-5719
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 2.4.0
>
> Attachments: hdfs-5719.txt
>
>
> {code}
> FSImage prevState = new FSImage(conf);
> {code}
> prevState should be closed before return from doRollback()



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

48 matches

Mail list logo