[jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API
[ https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257987#comment-13257987 ] Todd Lipcon commented on HADOOP-8148: - Duplicating my comment from HADOOP-8258: {quote} In current versions of Hadoop, the read path for applications like HBase often looks like: allocate a byte array for an HFile block (~64kb) call read() into that byte array: copy 1: read() packets from the socket into a direct buffer provided by the DirectBufferPool copy 2: copy from the direct buffer pool into the provided byte[] call setInput on a decompressor copy 3: copy from the byte[] back to a direct buffer inside the codec implementation call decompress: JNI code accesses the input buffer and writes to the output buffer copy 4: from the output buffer back into the byte[] for the uncompressed hfile block ineffiency: HBase now does its own checksumming. Since it has to checksum the byte[], it can't easily use the SSE-enabled checksum path. Given the new direct-buffer read support introduced by HDFS-2834, we can remove copy #2 and #3 allocate a DirectBuffer for the compressed hfile block, and one for the uncompressed block (we know the size from the hfile block header) call read() into the direct buffer using the HDFS-2834 API copy 1: read() packets from the socket into that buffer call setInput() with that buffer. no copies necessary call decompress: JNI code accesses the input buffer and writes directly to the output buffer, with no copies HBase now has the uncompressed block as a direct buffer. It can use the SSE-enabled checksum for better efficiency This should improve the performance of HBase significantly. We may also be able to use the new API from within SequenceFile and other compressible file formats to avoid two copies from the read path. Similar applies to the write path, but in my experience the write path is less often CPU-constrained, so I'd prefer to concentrate on the read path first. {quote} Zero-copy ByteBuffer-based compressor / decompressor API Key: HADOOP-8148 URL: https://issues.apache.org/jira/browse/HADOOP-8148 Project: Hadoop Common Issue Type: New Feature Components: io Reporter: Tim Broberg Assignee: Tim Broberg Attachments: hadoop8148.patch Per Todd Lipcon's comment in HDFS-2834, Whenever a native decompression codec is being used, ... we generally have the following copies: 1) Socket - DirectByteBuffer (in SocketChannel implementation) 2) DirectByteBuffer - byte[] (in SocketInputStream) 3) byte[] - Native buffer (set up for decompression) 4*) decompression to a different native buffer (not really a copy - decompression necessarily rewrites) 5) native buffer - byte[] with the proposed improvement we can hopefully eliminate #2,#3 for all applications, and #2,#3,and #5 for libhdfs. The interfaces in the attached patch attempt to address: A - Compression and decompression based on ByteBuffers (HDFS-2834) B - Zero-copy compression and decompression (HDFS-3051) C - Provide the caller a way to know how the max space required to hold compressed output. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253885#comment-13253885 ] Todd Lipcon commented on HADOOP-8247: - bq. The problem of this jira is that it makes the auto and manual failover exclusive to each other Yes, this is a temporary state along the way. As discussed elsewhere, we need to flip the manual HA commands over to communicate with the ZKFCs when automatic failover is enabled. Since that code isn't done yet, the current behavior is to disable manual failover. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Auto Failover (HDFS-3042) Attachments: hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8279) Auto-HA: Allow manual failover to be invoked from zkfc.
[ https://issues.apache.org/jira/browse/HADOOP-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253887#comment-13253887 ] Todd Lipcon commented on HADOOP-8279: - Thanks for filing this, Mingjie. I plan to work on it in the coming weeks. Auto-HA: Allow manual failover to be invoked from zkfc. --- Key: HADOOP-8279 URL: https://issues.apache.org/jira/browse/HADOOP-8279 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Mingjie Lai Assignee: Todd Lipcon Fix For: Auto Failover (HDFS-3042) HADOOP-8247 introduces a configure flag to prevent potential status inconsistency between zkfc and namenode, by making auto and manual failover mutually exclusive. However, as described in 2.7.2 section of design doc at HDFS-2185, we should allow manual and auto failover co-exist, by: - adding some rpc interfaces at zkfc - manual failover shall be triggered by haadmin, and handled by zkfc if auto failover is enabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8271) PowerPc Build error.
[ https://issues.apache.org/jira/browse/HADOOP-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252603#comment-13252603 ] Todd Lipcon commented on HADOOP-8271: - Patch looks good. Can you please make a patch against trunk, as well? We'll want to check this in to all branches. PowerPc Build error. Key: HADOOP-8271 URL: https://issues.apache.org/jira/browse/HADOOP-8271 Project: Hadoop Common Issue Type: Bug Components: build Affects Versions: 1.0.2, 1.0.3 Environment: Linux RHEL 6.1 PowerPC + IBM JVM 6.0 SR10 Reporter: Kumar Ravi Labels: patch Fix For: 1.0.3 Attachments: HADOOP-8271.patch Original Estimate: 168h Remaining Estimate: 168h When attempting to build branch-1, the following error is seen and ant exits. [exec] configure: error: Unsupported CPU architecture powerpc64 The following command was used to build hadoop-common ant -Dlibhdfs=true -Dcompile.native=true -Dfusedfs=true -Dcompile.c++=true -Dforrest.home=$FORREST_HOME compile-core-native compile-c++ compile-c++-examples task-controller tar record-parser compile-hdfs-classes package -Djava5.home=/opt/ibm/ibm-java2-ppc64-50/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8198) Support multiple network interfaces
[ https://issues.apache.org/jira/browse/HADOOP-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251892#comment-13251892 ] Todd Lipcon commented on HADOOP-8198: - I agree with the above comments that tokens are starting to fall apart. But, I don't think this current proposal has any relation to the token issue -- Eli is only proposing to add multi-NIC support for datanodes, and datanodes don't have service tokens. They only validate block tokens, which have no associated host/IP/etc. If we wanted multi-NIC on the NN RPC, the token issue would be a blocker, but I don't think that's the current proposal. Support multiple network interfaces --- Key: HADOOP-8198 URL: https://issues.apache.org/jira/browse/HADOOP-8198 Project: Hadoop Common Issue Type: New Feature Components: io, performance Reporter: Eli Collins Assignee: Eli Collins Attachments: MultipleNifsv1.pdf, MultipleNifsv2.pdf, MultipleNifsv3.pdf Hadoop does not currently utilize multiple network interfaces, which is a common user request, and important in enterprise environments. This jira covers a proposal for enhancements to Hadoop so it better utilizes multiple network interfaces. The primary motivation being improved performance, performance isolation, resource utilization and fault tolerance. The attached design doc covers the high-level use cases, requirements, a proposal for trunk/0.23, discussion on related features, and a proposal for Hadoop 1.x that covers a subset of the functionality of the trunk/0.23 proposal. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8198) Support multiple network interfaces
[ https://issues.apache.org/jira/browse/HADOOP-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251926#comment-13251926 ] Todd Lipcon commented on HADOOP-8198: - bq. Are we going to confine yarn/MR services to using only one NIC? If I recall correctly, the shuffle services use job tokens and not service tokens as well, right? I think it's OK to confine the RPC interfaces to using one NIC (for now) as they're generally not throughput-intensive. Adding multi-NIC support for them would be nice in the future for fault tolerance but I think it should be a separate task, since as you've brought up, it's much harder. Support multiple network interfaces --- Key: HADOOP-8198 URL: https://issues.apache.org/jira/browse/HADOOP-8198 Project: Hadoop Common Issue Type: New Feature Components: io, performance Reporter: Eli Collins Assignee: Eli Collins Attachments: MultipleNifsv1.pdf, MultipleNifsv2.pdf, MultipleNifsv3.pdf Hadoop does not currently utilize multiple network interfaces, which is a common user request, and important in enterprise environments. This jira covers a proposal for enhancements to Hadoop so it better utilizes multiple network interfaces. The primary motivation being improved performance, performance isolation, resource utilization and fault tolerance. The attached design doc covers the high-level use cases, requirements, a proposal for trunk/0.23, discussion on related features, and a proposal for Hadoop 1.x that covers a subset of the functionality of the trunk/0.23 proposal. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8269) Fix some javadoc warnings on branch-1
[ https://issues.apache.org/jira/browse/HADOOP-8269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251990#comment-13251990 ] Todd Lipcon commented on HADOOP-8269: - +1 Fix some javadoc warnings on branch-1 - Key: HADOOP-8269 URL: https://issues.apache.org/jira/browse/HADOOP-8269 Project: Hadoop Common Issue Type: Bug Components: documentation Reporter: Eli Collins Assignee: Eli Collins Attachments: hadoop-8269.txt There are some javadoc warnings on branch-1, let's fix them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8262) Between mapper and reducer, Hadoop inserts spaces into my string
[ https://issues.apache.org/jira/browse/HADOOP-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249942#comment-13249942 ] Todd Lipcon commented on HADOOP-8262: - http://hadoop.apache.org/mapreduce/mailing_lists.html has instructions on how to subscribe to the lists Between mapper and reducer, Hadoop inserts spaces into my string Key: HADOOP-8262 URL: https://issues.apache.org/jira/browse/HADOOP-8262 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.0 Environment: Eclipse plugin, Windows Reporter: Adriana Sbircea In the mapper i send as key a number, and as value another number which has more than one digit, but i send them as Text objects. In my reducer all the values for a key have spaces between every digit of a value. I can't do my task because of this problem. I don't use combiners or something else. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8152) Expand public APIs for security library classes
[ https://issues.apache.org/jira/browse/HADOOP-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250163#comment-13250163 ] Todd Lipcon commented on HADOOP-8152: - I generally agree that the static loginUser concept is a mess and should probably be killed in favor of using methods like {{loginFromKeytabAndReturnUGI}} everywhere. But I also agree with Aaron that we can mark these as evolving and it doesn't force our hand down the road. Expand public APIs for security library classes --- Key: HADOOP-8152 URL: https://issues.apache.org/jira/browse/HADOOP-8152 Project: Hadoop Common Issue Type: Improvement Components: security Affects Versions: 2.0.0 Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HADOOP-8152.patch, HADOOP-8152.patch Currently projects like Hive and HBase use UserGroupInformation and SecurityUtil methods. Both of these classes are marked LimitedPrivate(HDFS,MR) but should probably be marked more generally public. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8248) Clarify bylaws about review-then-commit policy
[ https://issues.apache.org/jira/browse/HADOOP-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250176#comment-13250176 ] Todd Lipcon commented on HADOOP-8248: - bq. For a join work of multiple committers, all of the authors cannot review the patch for significant patches. My thinking here is that it's fine if one committer does some minor fixup or adds test cases to a patch that another authored. For example, if I start a patch, but don't get time to finish the unit tests, and you help out by adding a test, I think it's OK for you to commit it assuming I +1 your addition. Put another way, any given chunk of the patch should be reviewed by a committer who didn't write it. I don't want to get too pedantic about it, though -- IMO it's the spirit that's important. Code reviews are important for spotting mistakes, and it's hard to spot your own mistakes. So any piece of code should be +1ed at by an expert (ie committer) who didn't write that bit of code. bq. For merging from a branch, the three +1's cannot be cast from any of the committers who worked on the branch. I disagree on this -- my assumption is that all of the patches on the branch have been reviewed according to the above policy, so everything's been looked at by someone who didn't write it. In my mind, the +1s on the merge are basically a commitment to stand by the work to be merged and an assertion that you think it is good code, a good feature, etc. If the development on the branch looks shoddy/sketchy/whatever, then there's plenty of opportunity for other committers to -1 it. Perhaps we should add a 3-day minimum voting period for branch merges to trunk when that branch didn't follow the normal RTC guidelines? Clarify bylaws about review-then-commit policy -- Key: HADOOP-8248 URL: https://issues.apache.org/jira/browse/HADOOP-8248 Project: Hadoop Common Issue Type: Task Reporter: Todd Lipcon Attachments: c8248_20120409.patch, proposed-bylaw-change.txt As discussed on the mailing list (thread Requirements for patch review 4/4/2012) we should clarify the bylaws with respect to the review-then-commit policy. This JIRA is to agree on the proposed change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8248) Clarify bylaws about review-then-commit policy
[ https://issues.apache.org/jira/browse/HADOOP-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250179#comment-13250179 ] Todd Lipcon commented on HADOOP-8248: - To add a little more: I think the requirement of 3 committer +1s from people who didn't work on the branch will make it really hard to ever merge branches. Looking, for example, at the recent HA branch merge, it listed the following people as patch contributors: bq. Contributed by Todd Lipcon, Aaron T. Myers, Eli Collins, Uma Maheswara Rao G, Bikas Saha, Suresh Srinivas, Jitendra Nath Pandey, Hari Mankude, Brandon Li, Sanjay Radia, Mingjie Lai, and Gregory Chanan Finding 3 active committers who are not on that list and are knowledgeable about NN internals would have been very difficult. In fact of the committers who did +1 the merge, you're the only one who isn't in the above list :) Clarify bylaws about review-then-commit policy -- Key: HADOOP-8248 URL: https://issues.apache.org/jira/browse/HADOOP-8248 Project: Hadoop Common Issue Type: Task Reporter: Todd Lipcon Attachments: c8248_20120409.patch, proposed-bylaw-change.txt As discussed on the mailing list (thread Requirements for patch review 4/4/2012) we should clarify the bylaws with respect to the review-then-commit policy. This JIRA is to agree on the proposed change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250271#comment-13250271 ] Todd Lipcon commented on HADOOP-8247: - Hi Hari. That's specifically the point of the FORCEMANUAL flag. It is not safe to use it with automatic failover. So, the user has to accept the warning and acknowledge they're about to do something dumb, that _will_ break auto failover if the ZKFCs are running. The purpose of allowing it at all is to give a recourse for an expert admin if their ZK cluster has crashed and they need to manually do a failover in an emergency situation. Its use is highly discouraged. The warning printed is: {code} --forceManual allows the manual failover commands to be used\n + even when automatic failover is enabled. This\n + flag is DANGEROUS and should only be used with\n + expert guidance.); {code} Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt, hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250272#comment-13250272 ] Todd Lipcon commented on HADOOP-8247: - P.S. if you'd like I'd be happy to rename it to something even scarier sounding... like --dangerous-manual-override, or whatever you prefer. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt, hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8248) Clarify bylaws about review-then-commit policy
[ https://issues.apache.org/jira/browse/HADOOP-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250291#comment-13250291 ] Todd Lipcon commented on HADOOP-8248: - bq. Which above policy? Branches can use RTC, or whatever they decide upon. Therefore it is possible that the branch content has not actually been reviewed by another committer before merging. Right, that's why I also added: Perhaps we should add a 3-day minimum voting period for branch merges to trunk when that branch didn't follow the normal RTC guidelines? Clarify bylaws about review-then-commit policy -- Key: HADOOP-8248 URL: https://issues.apache.org/jira/browse/HADOOP-8248 Project: Hadoop Common Issue Type: Task Reporter: Todd Lipcon Attachments: c8248_20120409.patch, proposed-bylaw-change.txt As discussed on the mailing list (thread Requirements for patch review 4/4/2012) we should clarify the bylaws with respect to the review-then-commit policy. This JIRA is to agree on the proposed change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250294#comment-13250294 ] Todd Lipcon commented on HADOOP-8247: - I also ran the manual tests again. Here's the usage output of HAAdmin: {code} Usage: DFSHAAdmin [-ns nameserviceId] [-transitionToActive [--forcemanual] serviceId] [-transitionToStandby [--forcemanual] serviceId] [-failover [--forcefence] [--forceactive] [--forcemanual] serviceId serviceId] [-getServiceState serviceId] [-checkHealth serviceId] [-help command] --forceManual allows the manual failover commands to be used even when automatic failover is enabled. This flag is DANGEROUS and should only be used with expert guidance. {code} Here's what happens if I try to use a state change command with auto-HA enabled: {code} $ ./bin/hdfs haadmin -transitionToActive nn1 Automatic failover is enabled for NameNode at todd-w510/127.0.0.1:8021 Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the forcemanual flag. $ echo $? 255 {code} Also checked the other two state-changing ops (transitionToStandby and failover) and they yielded the same error message. - I verified that {{-getServiceState}} and {{-checkHealth}} continue to work. - I verified that the -forceManual flag worked: {code} $ ./bin/hdfs haadmin -transitionToStandby -forcemanual nn1 12/04/09 16:12:38 WARN ha.HAAdmin: Proceeding with manual HA state management even though automatic failover is enabled for NameNode at todd-w510/127.0.0.1:8021 {code} (also for -transitionToActive and -failover) - Verified that {{start-dfs.sh}} starts the ZKFCs on both of my configured NNs when auto-HA is enabled. Also verified {{stop-dfs.sh}} stops the ZKFCs. Discovered trivial bug HDFS-3234 here. Next, I modified my config to set the auto failover flag to false. - verified that start-dfs.sh doesn't try to start ZKFCs. - verified that if I try to start a ZKFC, it bails: {code} 12/04/09 16:19:12 INFO tools.DFSZKFailoverController: Failover controller configured for NameNode nameserviceId1.nn2 12/04/09 16:19:12 FATAL ha.ZKFailoverController: Automatic failover is not enabled for NameNode at todd-w510/127.0.0.1:8022. Please ensure that automatic failover is enabled in the configuration before running the ZK failover controller. {code} - verified that the haadmin commands all function without any {{-forcemanual}} flag specified. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250307#comment-13250307 ] Todd Lipcon commented on HADOOP-8247: - bq. There are always admins who disregard these warnings I think they deserve what they get... admins can also decide to run rm -Rf /my/metadata/dir and get into a bad state. bq. Instead, wouldn't it be better to come up with a set of procedures to unwedge the cluster, starting with setting auto-failover key to false, resetting NNs and using manual failover Assumedly you want to be able to do this without incurring downtime. Certainly if downtime is acceptable, that would be the right response.. But still I think having a manual override here is useful for advanced operators who need to use it in an extenuating circumstance. As I said above, I'm OK giving it a scarier name and/or making it prompt for confirmation upon use, with a scary warning message. I'm even OK removing it from the documentation, so people aren't lured into using it when they don't really know what they're doing. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8152) Expand public APIs for security library classes
[ https://issues.apache.org/jira/browse/HADOOP-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249662#comment-13249662 ] Todd Lipcon commented on HADOOP-8152: - Looking at HBase, it seems like it's also using the following which aren't marked public by this patch: - SecurityUtil.getServerPrincipal - enum UGI.AuthenticationMethod (marked evolving but not marked public) - UGI.getRealUser - UGI.isLoginKeytabBased - UGI.reloginFromKeytab - UGI.reloginFromTicketCache - UGI.getUserName - UGI.createUserForTesting Expand public APIs for security library classes --- Key: HADOOP-8152 URL: https://issues.apache.org/jira/browse/HADOOP-8152 Project: Hadoop Common Issue Type: Improvement Components: security Affects Versions: 2.0.0 Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HADOOP-8152.patch Currently projects like Hive and HBase use UserGroupInformation and SecurityUtil methods. Both of these classes are marked LimitedPrivate(HDFS,MR) but should probably be marked more generally public. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8261) Har file system doesn't deal with FS URIs with a host but no port
[ https://issues.apache.org/jira/browse/HADOOP-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249663#comment-13249663 ] Todd Lipcon commented on HADOOP-8261: - Nit: spurious 'a' here at the end of the sentence {code} + * port specified, as is often the case with an HA setup.a {code} Another nit: I think the test case should be capitalized WithHA instead of WithHa to match our other test cases which all have the keyword HA in them (makes it easy to run mvn test '-Dtest=*HA*') +1 once you fix these Har file system doesn't deal with FS URIs with a host but no port - Key: HADOOP-8261 URL: https://issues.apache.org/jira/browse/HADOOP-8261 Project: Hadoop Common Issue Type: Bug Components: fs Affects Versions: 2.0.0 Reporter: Aaron T. Myers Assignee: Aaron T. Myers Attachments: HADOOP-8261-with-test-in-HDFS.patch, HADOOP-8261.patch If you try to run an MR job with a Hadoop Archive as the input, but the URI you give it has no port specified (e.g. hdfs://simon) the job will fail with an error like the following: {noformat} java.io.IOException: Incomplete HDFS URI, no host: hdfs://simon:-1/user/atm/input.har/input {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248746#comment-13248746 ] Todd Lipcon commented on HADOOP-8247: - I added a struct because I figured we may want to add more fields in the future that fulfill a similar purpose. For example, I can imagine that a failover event might be tagged with a string reason field -- sort of like how the Linux shutdown command can take a message. This would just be logged on the NN side. Another example is the proposed fix for HADOOP-8217, where we need to add an epoch number to the failover requests to get an ordering of failover events. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248918#comment-13248918 ] Todd Lipcon commented on HADOOP-8247: - Hi Hari. JIRA doesn't support cross-project subtasks. You can use the following filter to track all auto-HA related tasks: https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hiderequestId=12319482 (let me know if the link doesn't work, I think I set it up to be world-shared) Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8257) Auto-HA: TestZKFailoverControllerStress occasionally fails with Mockito error
[ https://issues.apache.org/jira/browse/HADOOP-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248928#comment-13248928 ] Todd Lipcon commented on HADOOP-8257: - Jenkins won't run on this since it's on a branch. I verified by changing the test runtime to 3 seconds and looping it. Without the patch, it failed with the mockito error after 3 or 4 minutes. I then looped with the patch for 15 minutes without a failure. Auto-HA: TestZKFailoverControllerStress occasionally fails with Mockito error - Key: HADOOP-8257 URL: https://issues.apache.org/jira/browse/HADOOP-8257 Project: Hadoop Common Issue Type: Bug Components: auto-failover, test Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Trivial Attachments: hadoop-8257.txt Once in a while I've seen the following in TestZKFailoverControllerStress: Unfinished stubbing detected here: - at org.apache.hadoop.ha.TestZKFailoverControllerStress.testRandomHealthAndDisconnects(TestZKFailoverControllerStress.java:118) E.g. thenReturn() may be missing This is because we set up the mock answers _after_ starting the ZKFCs. So if the ZKFC calls the mock object while it's in the middle of the setup, this exception occurs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8258) Add interfaces for compression codecs to use direct byte buffers
[ https://issues.apache.org/jira/browse/HADOOP-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249035#comment-13249035 ] Todd Lipcon commented on HADOOP-8258: - In current versions of Hadoop, the read path for applications like HBase often looks like: - allocate a byte array for an HFile block (~64kb) - call read() into that byte array: -- copy 1: read() packets from the socket into a direct buffer provided by the DirectBufferPool -- copy 2: copy from the direct buffer pool into the provided byte[] - call setInput on a decompressor -- copy 3: copy from the byte[] back to a direct buffer inside the codec implementation - call decompress: -- JNI code accesses the input buffer and writes to the output buffer -- copy 4: from the output buffer back into the byte[] for the uncompressed hfile block -- ineffiency: HBase now does its own checksumming. Since it has to checksum the byte[], it can't easily use the SSE-enabled checksum path. Given the new direct-buffer read support introduced by HDFS-2834, we can remove copy #2 and #3 - allocate a DirectBuffer for the compressed hfile block, and one for the uncompressed block (we know the size from the hfile block header) - call read() into the direct buffer using the HDFS-2834 API -- copy 1: read() packets from the socket into that buffer - call setInput() with that buffer. no copies necessary - call decompress: -- JNI code accesses the input buffer and writes directly to the output buffer, with no copies - HBase now has the uncompressed block as a direct buffer. It can use the SSE-enabled checksum for better efficiency This should improve the performance of HBase significantly. We may also be able to use the new API from within SequenceFile and other compressible file formats to avoid two copies from the read path. Similar applies to the write path, but in my experience the write path is less often CPU-constrained, so I'd prefer to concentrate on the read path first. Add interfaces for compression codecs to use direct byte buffers Key: HADOOP-8258 URL: https://issues.apache.org/jira/browse/HADOOP-8258 Project: Hadoop Common Issue Type: New Feature Components: io, native, performance Affects Versions: 3.0.0 Reporter: Todd Lipcon Currently, the codec interface only provides input/output functions based on byte arrays. Given that most of the codecs are implemented in native code, this necessitates two extra copies - one to copy the input data to a direct buffer, and one to copy the output data back to a byte array. We should add interfaces to Decompressor/Compressor that can work directly with direct byte buffers to avoid these copies. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8258) Add interfaces for compression codecs to use direct byte buffers
[ https://issues.apache.org/jira/browse/HADOOP-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249086#comment-13249086 ] Todd Lipcon commented on HADOOP-8258: - Ah, thanks, sorry I missed that. Do you think this JIRA should just be marked as duplicate? I can reproduce the comments into the other one. Add interfaces for compression codecs to use direct byte buffers Key: HADOOP-8258 URL: https://issues.apache.org/jira/browse/HADOOP-8258 Project: Hadoop Common Issue Type: New Feature Components: io, native, performance Affects Versions: 3.0.0 Reporter: Todd Lipcon Currently, the codec interface only provides input/output functions based on byte arrays. Given that most of the codecs are implemented in native code, this necessitates two extra copies - one to copy the input data to a direct buffer, and one to copy the output data back to a byte array. We should add interfaces to Decompressor/Compressor that can work directly with direct byte buffers to avoid these copies. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249097#comment-13249097 ] Todd Lipcon commented on HADOOP-8247: - bq. Can we make this simpler by not supporting manual failover? Yes. That's the current version of the patch - if you enable automatic, then you don't get manual. But, as described in the design doc in HDFS-2185, there are good reasons to support manually initiated failover even when the system is set up for automatic. That will be done separately as a followup. This patch is just meant for safety purposes. Another advantage of this patch is that we can amend the start-dfs.sh script to automatically start ZKFCs when the conf flag is present. My next rev will do this. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8086) KerberosName silently sets defaultRealm to if the Kerberos config is not found, it should log a WARN
[ https://issues.apache.org/jira/browse/HADOOP-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247410#comment-13247410 ] Todd Lipcon commented on HADOOP-8086: - This patch seems to use slf4j, whereas we use commons-logging elsewhere. Is this something particular to the hadoop-auth component? Or just a mistake? KerberosName silently sets defaultRealm to if the Kerberos config is not found, it should log a WARN --- Key: HADOOP-8086 URL: https://issues.apache.org/jira/browse/HADOOP-8086 Project: Hadoop Common Issue Type: Improvement Components: security Affects Versions: 0.23.2, 0.24.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Minor Fix For: 0.23.2 Attachments: HADOOP-8086.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8086) KerberosName silently sets defaultRealm to if the Kerberos config is not found, it should log a WARN
[ https://issues.apache.org/jira/browse/HADOOP-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247488#comment-13247488 ] Todd Lipcon commented on HADOOP-8086: - OK. I will pretend that that makes sense, and give a +1 for this patch then. KerberosName silently sets defaultRealm to if the Kerberos config is not found, it should log a WARN --- Key: HADOOP-8086 URL: https://issues.apache.org/jira/browse/HADOOP-8086 Project: Hadoop Common Issue Type: Improvement Components: security Affects Versions: 0.23.2, 0.24.0 Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Priority: Minor Fix For: 0.23.2 Attachments: HADOOP-8086.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-6941) Support non-SUN JREs in UserGroupInformation
[ https://issues.apache.org/jira/browse/HADOOP-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248011#comment-13248011 ] Todd Lipcon commented on HADOOP-6941: - Looks like this patch broke the original security support. See HADOOP-8251. Support non-SUN JREs in UserGroupInformation Key: HADOOP-6941 URL: https://issues.apache.org/jira/browse/HADOOP-6941 Project: Hadoop Common Issue Type: Bug Environment: SLES 11, Apache Harmony 6 and SLES 11, IBM Java 6 Reporter: Stephen Watt Assignee: Luke Lu Fix For: 1.0.3, 2.0.0 Attachments: 6941-1.patch, 6941-branch1.patch, HADOOP-6941.patch, hadoop-6941.patch Attempting to format the namenode or attempting to start Hadoop using Apache Harmony or the IBM Java JREs results in the following exception: 10/09/07 16:35:05 ERROR namenode.NameNode: java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal at org.apache.hadoop.security.UserGroupInformation.clinit(UserGroupInformation.java:223) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setConfigurationParameters(FSNamesystem.java:420) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:391) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Caused by: java.lang.ClassNotFoundException: com.sun.security.auth.UnixPrincipal at java.net.URLClassLoader.findClass(URLClassLoader.java:421) at java.lang.ClassLoader.loadClass(ClassLoader.java:652) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:346) at java.lang.ClassLoader.loadClass(ClassLoader.java:618) ... 8 more This is a negative regression as previous versions of Hadoop worked with these JREs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8251) SecurityUtil.fetchServiceTicket broken after HADOOP-6941
[ https://issues.apache.org/jira/browse/HADOOP-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248012#comment-13248012 ] Todd Lipcon commented on HADOOP-8251: - The bug was simple -- the string used for the name of the Krb5Util class was mistakenly just the package name instead of the class name. It looks like the IBM implementation has the same bug, but googling around, I don't think there even _is_ a Krb5Util class in IBM's library, at least not with the functions we need. So I am skeptical that security support works when running on the IBM JRE. SecurityUtil.fetchServiceTicket broken after HADOOP-6941 Key: HADOOP-8251 URL: https://issues.apache.org/jira/browse/HADOOP-8251 Project: Hadoop Common Issue Type: Bug Components: security Affects Versions: 1.1.0, 2.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Attachments: hadoop-8251.txt HADOOP-6941 replaced direct references to some classes with reflective access so as to support other JDKs. Unfortunately there was a mistake in the name of the Krb5Util class, which broke fetchServiceTicket. This manifests itself as the inability to run checkpoints or other krb5-SSL HTTP-based transfers: java.lang.ClassNotFoundException: sun.security.jgss.krb5 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
[ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248019#comment-13248019 ] Todd Lipcon commented on HADOOP-8247: - I should of course note that this is only the first step. After this is committed, the idea is to make the haadmin -failover command line work in coordination with the ZKFC daemons to do a controlled failover. But in the meantime, it's disallowed so that users can't shoot themselves in the foot by running this command. Auto-HA: add a config to enable auto-HA, which disables manual FC - Key: HADOOP-8247 URL: https://issues.apache.org/jira/browse/HADOOP-8247 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8247.txt Currently, if automatic failover is set up and running, and the user uses the haadmin -failover command, he or she can end up putting the system in an inconsistent state, where the state in ZK disagrees with the actual state of the world. To fix this, we should add a config flag which is used to enable auto-HA. When this flag is set, we should disallow use of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag is not set. Of course, this flag should be scoped by nameservice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-7211) Security uses proprietary Sun APIs
[ https://issues.apache.org/jira/browse/HADOOP-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248032#comment-13248032 ] Todd Lipcon commented on HADOOP-7211: - HADOOP-6941 fixed compilation on the IBM JDK using reflection, but added some code which definitely does not work - eg: {code} if (System.getProperty(java.vendor).contains(IBM)) { principalClass = Class.forName(com.ibm.security.krb5.PrincipalName); credentialsClass = Class.forName(com.ibm.security.krb5.Credentials); krb5utilClass = Class.forName(com.ibm.security.jgss.mech.krb5); {code} but the krb5utilClass here is invalid, and there doesn't appear to be any equivalent in the IBM JDK. Instead of this code which kind of looks like it should work, we should just throw an UnsupportedOperationException until someone actually fixes this. Security uses proprietary Sun APIs -- Key: HADOOP-7211 URL: https://issues.apache.org/jira/browse/HADOOP-7211 Project: Hadoop Common Issue Type: Improvement Components: security Reporter: Eli Collins Assignee: Luke Lu The security code uses the KrbException, Credentials, and PrincipalName classes from sun.security.krb5 and Krb5Util from sun.security.jgss.krb5. These may disappear in future Java releases. Also Hadoop does not compile using jdks that do not support them, for example with the following IBM JDK. {noformat} $ /home/eli/toolchain/java-x86_64-60/bin/java -version java version 1.6.0 Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1)) IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled) J9VM - 20110203_074623 JIT - r9_20101028_17488ifx3 GC - 20101027_AA) JCL - 20110203_01 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8251) SecurityUtil.fetchServiceTicket broken after HADOOP-6941
[ https://issues.apache.org/jira/browse/HADOOP-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248054#comment-13248054 ] Todd Lipcon commented on HADOOP-8251: - Hey Devaraj. Sorry, I already committed this, and I don't feel comfortable changing the code if I can't test it (I don't have ready access to an IBM JDK installation). I think rather than just fixing this bug, someone should run through the whole security test plan on the IBM JDK -- perhaps as part of HADOOP-7211? bq. Seems like the methods are there and with the desired signatures.. bq. Did I miss something? I was basing it on these docs: http://www.ibm.com/developerworks/java/jdk/security/60/secguides/jgssDocs/api/index.html?com/ibm/security/jgss/mech/krb5/Krb5RealmUtil.html which don't mention krb5util in that package SecurityUtil.fetchServiceTicket broken after HADOOP-6941 Key: HADOOP-8251 URL: https://issues.apache.org/jira/browse/HADOOP-8251 Project: Hadoop Common Issue Type: Bug Components: security Affects Versions: 1.1.0, 2.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Fix For: 1.0.3, 1.1.0, 2.0.0 Attachments: hadoop-8251-b1.txt, hadoop-8251.txt HADOOP-6941 replaced direct references to some classes with reflective access so as to support other JDKs. Unfortunately there was a mistake in the name of the Krb5Util class, which broke fetchServiceTicket. This manifests itself as the inability to run checkpoints or other krb5-SSL HTTP-based transfers: java.lang.ClassNotFoundException: sun.security.jgss.krb5 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8251) SecurityUtil.fetchServiceTicket broken after HADOOP-6941
[ https://issues.apache.org/jira/browse/HADOOP-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248078#comment-13248078 ] Todd Lipcon commented on HADOOP-8251: - bq. Please have at least one simple test that fails without the patch. I'm just fixing what the previous patch broke. I don't have time to write a test, since this depends on security infrastructure, etc, and I can't get that to work right (see my comment on HDFS-3016). The original patch should have had a test, I agree. But my options were to revert that patch, or just fix it, so I did the latter without a test. SecurityUtil.fetchServiceTicket broken after HADOOP-6941 Key: HADOOP-8251 URL: https://issues.apache.org/jira/browse/HADOOP-8251 Project: Hadoop Common Issue Type: Bug Components: security Affects Versions: 1.1.0, 2.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Blocker Fix For: 1.0.3, 1.1.0, 2.0.0 Attachments: hadoop-8251-b1.txt, hadoop-8251.txt HADOOP-6941 replaced direct references to some classes with reflective access so as to support other JDKs. Unfortunately there was a mistake in the name of the Krb5Util class, which broke fetchServiceTicket. This manifests itself as the inability to run checkpoints or other krb5-SSL HTTP-based transfers: java.lang.ClassNotFoundException: sun.security.jgss.krb5 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-7211) Security uses proprietary Sun APIs
[ https://issues.apache.org/jira/browse/HADOOP-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248081#comment-13248081 ] Todd Lipcon commented on HADOOP-7211: - bq. This jira is incorporated by the patches in HADOOP-6941 and HADOOP-7211 Did you mean another JIRA? This _is_ HADOOP-7211 Security uses proprietary Sun APIs -- Key: HADOOP-7211 URL: https://issues.apache.org/jira/browse/HADOOP-7211 Project: Hadoop Common Issue Type: Improvement Components: security Reporter: Eli Collins Assignee: Luke Lu The security code uses the KrbException, Credentials, and PrincipalName classes from sun.security.krb5 and Krb5Util from sun.security.jgss.krb5. These may disappear in future Java releases. Also Hadoop does not compile using jdks that do not support them, for example with the following IBM JDK. {noformat} $ /home/eli/toolchain/java-x86_64-60/bin/java -version java version 1.6.0 Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1)) IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled) J9VM - 20110203_074623 JIT - r9_20101028_17488ifx3 GC - 20101027_AA) JCL - 20110203_01 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-7211) Security uses proprietary Sun APIs
[ https://issues.apache.org/jira/browse/HADOOP-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248084#comment-13248084 ] Todd Lipcon commented on HADOOP-7211: - I don't think this JIRA should be marked as duplicate, because clearly HADOOP-6941 wasn't thoroughly tested. As soon as I tried running a cluster with a 2NN I found that it didn't work. So I'm skeptical that there isn't more work to do... Security uses proprietary Sun APIs -- Key: HADOOP-7211 URL: https://issues.apache.org/jira/browse/HADOOP-7211 Project: Hadoop Common Issue Type: Improvement Components: security Reporter: Eli Collins Assignee: Luke Lu The security code uses the KrbException, Credentials, and PrincipalName classes from sun.security.krb5 and Krb5Util from sun.security.jgss.krb5. These may disappear in future Java releases. Also Hadoop does not compile using jdks that do not support them, for example with the following IBM JDK. {noformat} $ /home/eli/toolchain/java-x86_64-60/bin/java -version java version 1.6.0 Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1)) IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled) J9VM - 20110203_074623 JIT - r9_20101028_17488ifx3 GC - 20101027_AA) JCL - 20110203_01 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8007) HA: use substitution token for fencing argument
[ https://issues.apache.org/jira/browse/HADOOP-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246113#comment-13246113 ] Todd Lipcon commented on HADOOP-8007: - bq. org.apache.hadoop.ha.TestZKFailoverController This failure was the JMXEnv issue tracked in HADOOP-8245. I will commit this momentarily HA: use substitution token for fencing argument --- Key: HADOOP-8007 URL: https://issues.apache.org/jira/browse/HADOOP-8007 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 2.0.0 Reporter: Aaron T. Myers Assignee: Todd Lipcon Attachments: hadoop-8007.txt, hadoop-8007.txt Per HADOOP-7983 currently the fencer always passes the target host:port to fence as the first argument to the fence script, it would be better to use a substitution token. That is to say, the user would configure myfence.sh $TARGETHOST foo bar and Hadoop would substitute the target. This would allow use of pre-existing scripts that might have a different ordering of arguments without a wrapper. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8215) Security support for ZK Failover controller
[ https://issues.apache.org/jira/browse/HADOOP-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245892#comment-13245892 ] Todd Lipcon commented on HADOOP-8215: - I'll commit this momentarily to the branch based on ATM's above +1, since the review feedback changes were mostly cosmetic. I ran the ZKFC and HAAdmin tests locally for both common and HDFS and they passed. Security support for ZK Failover controller --- Key: HADOOP-8215 URL: https://issues.apache.org/jira/browse/HADOOP-8215 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Fix For: Auto Failover (HDFS-3042) Attachments: hadoop-8215.txt, hadoop-8215.txt To keep the initial patches manageable, kerberos security is not currently supported in the ZKFC implementation. This JIRA is to support the following important pieces for security: - integrate with ZK authentication (kerberos or password-based) - allow the user to configure ACLs for the relevant znodes - add keytab configuration and login to the ZKFC daemons - ensure that the RPCs made by the health monitor and failover controller properly authenticate to the target daemons -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8245) Fix flakiness in TestZKFailoverController
[ https://issues.apache.org/jira/browse/HADOOP-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245991#comment-13245991 ] Todd Lipcon commented on HADOOP-8245: - For problem #1, the solution is the same as is already done in some other test cases. We just need to add a workaround to clear the ZK MBeans before running the tearDown method. It's a hack, but in the absense of a fix for ZOOKEEPER-1438, it's about all we can do. I spent some time investigating problem #2. The bug is as follows: - these test cases create a new ActiveStandbyElector, and call {{ActiveStandbyElector.ensureBaseNode()}} on it before running the main body of the tests. Although they don't call {{joinElection()}}, the creation of the elector does create a {{zkClient}} object with an associated Watcher. - in the {{testZookeeperFailure}} test case, we shut down and restart ZK. This causes the above Watcher instance to fire its Disconnected and then Connected events. There was a bug in the handling of the Connected event that would cause it to re-monitor the lock znode regardless of whether it was previously in the election. - So, when ZK comes back up, there was not two but *three* electors racing for the lock. However, two of the electors actually corresponded to the same dummy service. In some cases this race would be resolved in such a way that the test timed out. I don't think this is a problem in practice, since the formatZK call runs in its own JVM in the current code. However, it's worth fixing to get the tests to not be flaky, and to have a more reasonable behavior. There are several fixes to be done: - Add extra asserts for {{wantToBeInElection}} to catch cases where we might accidentally re-join the election when we weren't supposed to be in it. - Fix the handling of the Connected event to only re-join if the elector wants to be in the election - Cause exceptions thrown by watcher callbacks to be propagated back as fatal errors Will post a patch momentarily. Fix flakiness in TestZKFailoverController - Key: HADOOP-8245 URL: https://issues.apache.org/jira/browse/HADOOP-8245 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Minor When I loop TestZKFailoverController, I occasionally see two types of failures: 1) the ZK JMXEnv issue (ZOOKEEPER-1438) 2) TestZKFailoverController.testZooKeeperFailure fails with a timeout This JIRA is for fixes for these issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8210) Common side of HDFS-3148
[ https://issues.apache.org/jira/browse/HADOOP-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244403#comment-13244403 ] Todd Lipcon commented on HADOOP-8210: - +1 Common side of HDFS-3148 Key: HADOOP-8210 URL: https://issues.apache.org/jira/browse/HADOOP-8210 Project: Hadoop Common Issue Type: Sub-task Components: io, performance Reporter: Eli Collins Assignee: Eli Collins Attachments: hadoop-8210.txt, hadoop-8210.txt Common side of HDFS-3148, add necessary DNS and NetUtils methods. Test coverage is in the HDFS-3148 patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8243) Security support broken in CLI (manual) failover controller
[ https://issues.apache.org/jira/browse/HADOOP-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244625#comment-13244625 ] Todd Lipcon commented on HADOOP-8243: - I should note I also ran TestDFSHAAdmin and TestDFSHAAdminMiniCluster against this common patch, and they both passed. Security support broken in CLI (manual) failover controller --- Key: HADOOP-8243 URL: https://issues.apache.org/jira/browse/HADOOP-8243 Project: Hadoop Common Issue Type: Bug Components: ha, security Affects Versions: 2.0.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8243.txt Some recent refactoring accidentally caused the proxies in some places to get created with a default Configuration, instead of using the Configuration set up by the DFSHAAdmin tool. This causes the HAServiceProtocol to be missing the configuration which specifies the NN principle -- and thus breaks the CLI HAAdmin tool in secure setups. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8215) Security support for ZK Failover controller
[ https://issues.apache.org/jira/browse/HADOOP-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244654#comment-13244654 ] Todd Lipcon commented on HADOOP-8215: - I'm starting to work on this. Here's the plan: bq. integrate with ZK authentication (kerberos or password-based) Based on https://github.com/ekoontz/zookeeper/wiki and http://hbase.apache.org/configuration.html#zk.sasl.auth it looks like the SASL setup is a bit complicated, though entirely configuration based. I think for a first pass we should be OK to just use password-based authentication for ZK. I think this is sufficient because we have a well-defined set of clients that need to access these znodes, and they don't contain any content that needs to be encrypted over the wire. We can add SASL support later. bq. allow the user to configure ACLs for the relevant znodes This is reasonably straightforward - just needs some additional configuration keys to specify the ACL, and then tying it in to where we create the znodes. bq. add keytab configuration and login to the ZKFC daemons I think it should be OK to re-use the namenode principals here. That simplifies deployment since it avoids having to add new principals to the KDC, and given that the ZKFCs are intended to run on the same machines as the NNs, they will have access to the keytab files by default. Please speak up if you think we need separate keytabs/principals for the ZKFC daemons. bq. ensure that the RPCs made by the health monitor and failover controller properly authenticate to the target daemons This is just a matter of making sure we set up the target principal in the Configuration, and do the proper login/doAs before we start the main ZKFC code. Security support for ZK Failover controller --- Key: HADOOP-8215 URL: https://issues.apache.org/jira/browse/HADOOP-8215 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical To keep the initial patches manageable, kerberos security is not currently supported in the ZKFC implementation. This JIRA is to support the following important pieces for security: - integrate with ZK authentication (kerberos or password-based) - allow the user to configure ACLs for the relevant znodes - add keytab configuration and login to the ZKFC daemons - ensure that the RPCs made by the health monitor and failover controller properly authenticate to the target daemons -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8215) Security support for ZK Failover controller
[ https://issues.apache.org/jira/browse/HADOOP-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244875#comment-13244875 ] Todd Lipcon commented on HADOOP-8215: - Because coverage of security is hard to automate, I performed the following manual test steps to verify this patch on a secure cluster: - Set up two NNs with kerberos security enabled - Use ZK command line to generate digest credentials: {code} todd@todd-w510:~/releases/zookeeper-3.4.1-cdh4b1$ java -cp lib/*:zookeeper-3.4.1-cdh4b1.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider foo:testing foo:testing-foo:vlUvLnd8MlacsE80rDuu6ONESbM= {code} Add these two the HDFS configuration: {code} property nameha.zookeeper.acl/name valuedigest:foo:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda/value /property property nameha.zookeeper.auth/name valuedigest:foo:testing/value /property {code} - Run bin/hdfs zkfc -formatZK - Run bin/hdfs zkfc for each NN - Run bin/hdfs namenode for each NN - Verify that one of the NNs becomes active. Kill that NN. Verify that the other NN becomes active within a few seconds. - Verify authentication results in the NN logs: {code} 12/04/02 17:25:22 INFO authorize.ServiceAuthorizationManager: Authorization successfull for hdfs-todd/todd-w...@hadoop.com (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol {code} - Use ZK CLI to verify the acls: {code} [zk: localhost:2181(CONNECTED) 1] addauth digest foo:testing [zk: localhost:2181(CONNECTED) 2] ls /hadoop-ha [ActiveBreadCrumb, ActiveStandbyElectorLock] [zk: localhost:2181(CONNECTED) 3] getAcl /hadoop-ha 'digest,'foo:vlUvLnd8MlacsE80rDuu6ONESbM= : cdrwa [zk: localhost:2181(CONNECTED) 4] getAcl /hadoop-ha/ActiveBreadCrumb 'digest,'foo:vlUvLnd8MlacsE80rDuu6ONESbM= : cdrwa {code} - Shut down nodes, replace configuration with indirect version: {code} property nameha.zookeeper.acl/name value@/home/todd/confs/devconf.ha.common/zk-acl.txt/value /property property nameha.zookeeper.auth/name value@/home/todd/confs/devconf.ha.common/zk-auth.txt/value /property {code} and move the actual values to the files as specified above - Restart ZKFCs, verify that the ACLs are still being correctly used - chmod 000 the ACL data so it's no longer readable, try to restart one of the ZKFCs, verify error: {code} Exception in thread main java.io.FileNotFoundException: /home/todd/confs/devconf.ha.common/zk-acl.txt (Permission denied) {code} Security support for ZK Failover controller --- Key: HADOOP-8215 URL: https://issues.apache.org/jira/browse/HADOOP-8215 Project: Hadoop Common Issue Type: Improvement Components: auto-failover, ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8215.txt To keep the initial patches manageable, kerberos security is not currently supported in the ZKFC implementation. This JIRA is to support the following important pieces for security: - integrate with ZK authentication (kerberos or password-based) - allow the user to configure ACLs for the relevant znodes - add keytab configuration and login to the ZKFC daemons - ensure that the RPCs made by the health monitor and failover controller properly authenticate to the target daemons -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8211) Update commons-net version to 3.1
[ https://issues.apache.org/jira/browse/HADOOP-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243587#comment-13243587 ] Todd Lipcon commented on HADOOP-8211: - +1, assuming you've done a full build locally and run ftpfs-related tests. (are there any such? I can't seem to find any, since HDFS-441 removed it from HDFS but HADOOP-6119 never re-committed it in Common) Update commons-net version to 3.1 - Key: HADOOP-8211 URL: https://issues.apache.org/jira/browse/HADOOP-8211 Project: Hadoop Common Issue Type: Sub-task Components: io, performance Reporter: Eli Collins Assignee: Eli Collins Attachments: hadoop-8211.txt HADOOP-8210 requires the commons-net version be upgraded. Let's bump it to the latest stable version. The only other user is FtpFs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8210) Common side of HDFS-3148
[ https://issues.apache.org/jira/browse/HADOOP-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243597#comment-13243597 ] Todd Lipcon commented on HADOOP-8210: - {code} +LinkedHashSetInetAddress addrs = new LinkedHashSetInetAddress(); {code} I think it's worth changing the return type of this function to LinkedHashSet, so it's clear that the ordering here is on purpose. Perhaps also add a comment here saying something like: {code} // See below for reasoning behind using an ordered set. {code} {code} +// that depend on a particular element being 1st in the array. +// Eg. getDefaultIP always returns the 1st element. {code} Nits: please un-abbreviate first for better readability. Also, e.g. instead of Eg. -- or just say For example {code} + ips[i] = addr.getHostAddress(); + i++; {code} I think it's more idiomatic to just put the postincrement inside the []s - there's a small spurious whitespace change in NetUtils.java - looks like the pom change is still in this patch (redundant with HADOOP-8211) Common side of HDFS-3148 Key: HADOOP-8210 URL: https://issues.apache.org/jira/browse/HADOOP-8210 Project: Hadoop Common Issue Type: Sub-task Components: io, performance Reporter: Eli Collins Assignee: Eli Collins Attachments: hadoop-8210.txt Common side of HDFS-3148, add necessary DNS and NetUtils methods. Test coverage is in the HDFS-3148 patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242664#comment-13242664 ] Todd Lipcon commented on HADOOP-8220: - bq. Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable? Currently, ActiveStandbyElector doesn't take a Configuration object. I think many of the parameters should be changed to be configured via Configuration, but I didn't want to make this into a bigger scoped change. bq. There's some inconsistency in capitalization between reJoinElection and rejoinElectionAfterFailureToBecomeActive Changed to consistently use reJoin to match the previously existing code. bq. Might want to do a s/System.currentTimeMillis/Util.now/g The {{Util}} class is in HDFS, but this code is in common. We don't seem to have an equivalent in common. bq. Any reason we shouldn't make LOG_INTERVAL_MS configurable? It's just test code, so seemed unnecessary. bq. Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static. bq. Considering the comment says after sleeping for a short period in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect. Done. I made the overridden method keep a tally of number of slept millis, and added asserts to the tests to make sure it slept for some time when rejoining. ZKFailoverController doesn't handle failure to become active correctly -- Key: HADOOP-8220 URL: https://issues.apache.org/jira/browse/HADOOP-8220 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8228) Auto HA: Refactor tests and add stress tests
[ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242677#comment-13242677 ] Todd Lipcon commented on HADOOP-8228: - bq. One question: are you positive that the ordering of the two @After methods either doesn't matter, or is guaranteed to happen in the right order? The order of the two @After methods is nondeterministic. But, in this case, it's only important that our @After method runs before the superclass (ClientBase)'s tearDown. JUnit does guarantee the ordering in this case. bq. One comment: maybe use a deterministic random seed for the Random instances you're using? Or at least log the amount of time that the test is sleeping for and what it's throwing? Good point. I added additional logging for when it throws exceptions, and for when it expires sessions. I don't think the deterministic seed helps things, since the interleaving is still non-deterministic (that's part of the value of these tests :) ) Auto HA: Refactor tests and add stress tests Key: HADOOP-8228 URL: https://issues.apache.org/jira/browse/HADOOP-8228 Project: Hadoop Common Issue Type: Test Components: auto-failover, ha, test Affects Versions: Auto Failover (HDFS-3042) Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242855#comment-13242855 ] Todd Lipcon commented on HADOOP-8217: - bq. 3. ZKFC2 tries to do transitionToStandby() on NN1. RPC times out. bq. 4. Don't know what happens now in your design As has been the case in all of the HA work up to and including this point, it initiates the fence method at this point. The fence method has to do persistent fencing of the shared resource (eg. disable access to the SAN or STONITH the node). Please refer to the code in which I think this is fairly clear. The solution here is to improve the ability to do failover when graceful fencing suffices. In many failover cases it's preferable to _not_ have to invoke STONITH or storage fencing, since those mechanisms will often require administrative intervention to un-fence. bq. Given, the above, how will NN1 receive the zxid from ZKFC2? If it does not then the solution is invalid. Hari's scenario exemplifies this. All transitionToActive/transitionToStandby calls would include the zxid. So, the sequence becomes: 1. ZKFC1 gets active lock (zxid=1) 2. ZKFC1 is about to send transitionToActive(1) and machine freezes (eg GC pause + swapping) 3. ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock (zxid=2) 4. ZKFC2 calls NN1.transitionToStandby(2) and NN2.transitionToActive(2). 5. ZKFC1 wakes up from pause, calls NN1.transitionToActive(1). NN1 rejects the request because it previously accepted zxid=2 in step 4 above. or the failure case: 4(failure case): if NN1.transitionToStandby() times out or fails, the non-graceful fencing is initiated (same as in existing HA code for the last several months) 5(failure case with storage fencing): ZKFC1 wakes up from pause, and calls NN1.transitionToActive(1). NN1 tries to access the shared edits storage and fails, because it has been fenced. So, there is no split-brain. 5(failure case with STONITH): ZKFC1 never wakes up from pause, because its power plug has been pulled. So, there is no split-brain. Edge case split-brain race in ZK-based auto-failover Key: HADOOP-8217 URL: https://issues.apache.org/jira/browse/HADOOP-8217 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8217-testcase.txt As discussed in HADOOP-8206, the current design for automatic failover has the following race: - ZKFC1 gets active lock - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping) - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242880#comment-13242880 ] Todd Lipcon commented on HADOOP-8217: - bq. Can you please point me to the existing HA code for last several months? I thought we have manual HA in which admin does fencing. See HDFS-2179 (committed last August), which added the fencing code, and HADOOP-7938, which added the fencing behavior to the manual failover controller (committed in January). The HA guide ({{hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm}}) also details the configuration and operation of the fencing: {quote} * failover - initiate a failover between two NameNodes This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned. {quote} Edge case split-brain race in ZK-based auto-failover Key: HADOOP-8217 URL: https://issues.apache.org/jira/browse/HADOOP-8217 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8217-testcase.txt As discussed in HADOOP-8206, the current design for automatic failover has the following race: - ZKFC1 gets active lock - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping) - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly
[ https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242895#comment-13242895 ] Todd Lipcon commented on HADOOP-8202: - This patch also introduced the following bug: if the proxy.close() function throws an IOException, then HadoopIllegalArgumentException will be thrown, claiming that the proxy doesn't implement Closeable. This is the wrong error to throw, and is a regression in behavior (failure to close due to IOE should just be a warning, as it was previously). Hari, would you mind fixing this? stopproxy() is not closing the proxies correctly Key: HADOOP-8202 URL: https://issues.apache.org/jira/browse/HADOOP-8202 Project: Hadoop Common Issue Type: Bug Components: ipc Affects Versions: 0.24.0 Reporter: Hari Mankude Assignee: Hari Mankude Priority: Minor Fix For: 2.0.0 Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, HADOOP-8202-3.patch, HADOOP-8202-4.patch, HADOOP-8202.patch, HADOOP-8202.patch I was running testbackupnode and noticed that NNprotocol proxy was not being closed. Talked with Suresh and he observed that most of the protocols do not implement ProtocolTranslator and hence the logic in stopproxy() does not work. Instead, since all of them are closeable, Suresh suggested that closeable property should be used at close. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242944#comment-13242944 ] Todd Lipcon commented on HADOOP-8217: - bq. I would like to question the value of FC2 calling NN1.transitionToStandby() in general. FC1 on NN1 is supposed to call NN1.transitionToStandby() because thats is FC1's responsibility upon losing the leader lock. This doesn't work, since FC1 can take arbitrarily long to notice that it has lost its lock. bq. Secondly, based on the recent work done to add breadcrumbs to the ActiveStandbyElector, FC2 is going to fence NN1 if NN1 has not gracefully given up the lock, which is clearly the case here. So the problem is already solved unless I am mistaken. But the first stage of fencing is to gracefully ask the NN to go to standby. This is exactly the problem here. If, instead, we always required that we always use an aggressive fencing mechanism (STONITH/NAS fencing), you're right that there would not be a problem. But we can avoid that in many cases -- for example, imagine that the active node loses its connection to the ZK quorum, but still has a connection to the other NN (eg by a crossover cable). In this case it will leave its breadcrumb znode there, but the new active can easily transition it to standby. Here's another way of looking at this JIRA: - the aggressive fencing mechanisms have the property of being persistent. i.e after fencing, the node cannot become active, even if asked to. - the graceful fencing mechanism (transitionToStandby() RPC) does not currently have the property of being persistent. If another older node asks it to become active after it's been gracefully fenced, it will do so incorrectly. - This JIRA makes graceful fencing persistent, so it can be used correctly. Regarding the ActiveStandbyElector callback for {{becomeStandby}}, I actually think it's redundant. There are two cases in which it could be called: - If already standby, it's a no-op - If active, then this indicates that the elector lost its znode. Since it lost its znode (rather than quitting the election gracefully), it will leave its breadcrumb behind. Thus, the other node will fence it. So, calling transitionToStandby is redundant with fencing which the other node will have to perform anyway. Edge case split-brain race in ZK-based auto-failover Key: HADOOP-8217 URL: https://issues.apache.org/jira/browse/HADOOP-8217 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8217-testcase.txt As discussed in HADOOP-8206, the current design for automatic failover has the following race: - ZKFC1 gets active lock - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping) - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8212) Improve ActiveStandbyElector's behavior when session expires
[ https://issues.apache.org/jira/browse/HADOOP-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241832#comment-13241832 ] Todd Lipcon commented on HADOOP-8212: - Thanks for reviewing the addendum, and for your comments. I'll commit the addendum to the new HDFS-3042 branch momentarily. Improve ActiveStandbyElector's behavior when session expires Key: HADOOP-8212 URL: https://issues.apache.org/jira/browse/HADOOP-8212 Project: Hadoop Common Issue Type: Improvement Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: Auto Failover (HDFS-3042) Attachments: hadoop-8212-delta-bikas.txt, hadoop-8212.txt, hadoop-8212.txt Currently when the ZK session expires, it results in a fatal error being sent to the application callback. This is not the best behavior -- for example, in the case of HA, if ZK goes down, we would like the current state to be maintained, rather than causing either NN to abort. When the ZK clients are able to reconnect, they should sort out the correct leader based on the normal locking schemes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241935#comment-13241935 ] Todd Lipcon commented on HADOOP-8220: - Actually moving the error handling code to the call site (instead of inside becomeActive()) introduced a bug, since we call becomeActive() from another spot as well, in the StatCallback. So we need to have similar code there, or move the error handling back up into becomeActive() ZKFailoverController doesn't handle failure to become active correctly -- Key: HADOOP-8220 URL: https://issues.apache.org/jira/browse/HADOOP-8220 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240237#comment-13240237 ] Todd Lipcon commented on HADOOP-8217: - Suresh: we've already had a meeting ostensibly for this purpose, I think. There is also a design document posted to HDFS-2185. The document doesn't include every possible scenario, because I don't have infinite foresight. I don't think having meetings or more reviews of the design doc will help that. For example, with the original manual-failover project, we had several design meetings as well as a design document posted on HDFS-1623. Looking back at that project, the design document captured the overall idea (like the HDFS-2185 one does here) but did not foresee some of the trickiest issues we dealt with during implementation (for example, how to deal with invalidations with regard to datanode fencing, how to handle safe mode, how to deal with delegation tokens, etc). In that project, as we came upon each new scenario to deal with, we opened a JIRA and had a discussion on the design solution for that particular scenario. I don't see why we can't do the same here. Nor do I see why we are likely to be able to foresee all the corner cases a priori here better than we were able to with HDFS-1623. So, I am not going to pause work to wait for meetings or more design discussion. If you see problems with the design, please comment on the design doc on HDFS-2185, or on the individual JIRAs which seem to have problems. I'm happy to address them, even after commit (eg I'm currently addressing Bikas's review comments on HADOOP-8212) Since there seems to be concern that we are moving too fast, I will create an auto-failover branch later tonight to continue working on implementing this design. I'll also create a new auto-failover component on JIRA so it's easier to follow them. If you have concerns about the implementation or the design when it comes time to merge it, please do vote against the merge, voicing whatever objections you might have. And please do comment along the way if you see issues. Thanks. Edge case split-brain race in ZK-based auto-failover Key: HADOOP-8217 URL: https://issues.apache.org/jira/browse/HADOOP-8217 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon As discussed in HADOOP-8206, the current design for automatic failover has the following race: - ZKFC1 gets active lock - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping) - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240776#comment-13240776 ] Todd Lipcon commented on HADOOP-8220: - Yep, your updated description of the tight loop is exactly right. Sorry, I didn't note the fact that becomeActive() throws an exception in this scenario. New draft of the patch attached. - Added a true unit test for the new changes, in addition to the functional test from the prior revision (TestActiveStandbyElector#testFailToBecomeActive) - Change the control flow so that the success and error cases are kept near each other (suggested by Bikas above) - Changed the sleep calls to be wrapped in a {{sleepFor(ms)}} function, so it's easy to disable the sleeping behavior in the unit tests. Otherwise the tests ran longer for no good reason. In response to a couple comments above that got lost in the discussion: {quote} 2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock. {quote} This is really difficult to do reliably, since there's no good way to 'cancel' the callback. The {{transitionToActive}} RPC itself should have a timeout attached -- it's much more straightforward to do that than to try to make ActiveStandbyElector guard against arbitrary code running too long in the callback. I added a note to the javadoc indicating this. {quote} Do you really want to commit the logs added to ActiveStandbyTestUtil? {quote} Yes, I found that when I had a test failure due to timeout, it was difficult to debug, since I couldn't easily tell which node had the lock at the time the test timed out. I rate-limited the logging to only two per second, so it shouldn't make the logs too noisy, while retaining the advantage of seeing what's going on better if there is a timeout. ZKFailoverController doesn't handle failure to become active correctly -- Key: HADOOP-8220 URL: https://issues.apache.org/jira/browse/HADOOP-8220 Project: Hadoop Common Issue Type: Bug Components: auto-failover, ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8220.txt, hadoop-8220.txt The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8212) Improve ActiveStandbyElector's behavior when session expires
[ https://issues.apache.org/jira/browse/HADOOP-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239631#comment-13239631 ] Todd Lipcon commented on HADOOP-8212: - bq. I think we want to added similar handling in the StatCallback. Its another race waiting to happen. The patch does add the same handling to StatCallback. It uses the ZooKeeper context parameter to pass the original zkClient. Unfortunately the Watcher interface doesn't have any context object, which is why I had to introduce the wrapper class there. bq. The comment on processWatchEvent needs to change slightly to reflect that its the proxied watcher callback handler. Does the following look good? {code} - * interface implementation of Zookeeper watch events (connection and node) + * interface implementation of Zookeeper watch events (connection and node), + * proxied by {@link WatcherWithClientRef}. {code} bq. Whats the hurry? In my experience working on similar projects in the past, getting all the initial code in place is only half the battle. The real work starts once the code is there and you start banging on it in realistic test scenarios. We'd like to see automatic failover be a supported piece of the HA solution in 0.23.x (..err..2.0), and to hit that timeline, we need to get into the latter phase ASAP. I'm less aggressive when it comes to changing existing code, but since this is all new code, there's no risk of regressing working features by moving fast here. Once it starts to stabilize we can afford to slow down the rate of change. If you'd prefer, I'm happy to create a feature branch for auto-failover and then call a merge vote when it's ready for the full QA onslaught. Improve ActiveStandbyElector's behavior when session expires Key: HADOOP-8212 URL: https://issues.apache.org/jira/browse/HADOOP-8212 Project: Hadoop Common Issue Type: Improvement Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.23.3, 0.24.0 Attachments: hadoop-8212.txt, hadoop-8212.txt Currently when the ZK session expires, it results in a fatal error being sent to the application callback. This is not the best behavior -- for example, in the case of HA, if ZK goes down, we would like the current state to be maintained, rather than causing either NN to abort. When the ZK clients are able to reconnect, they should sort out the correct leader based on the normal locking schemes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239641#comment-13239641 ] Todd Lipcon commented on HADOOP-8220: - I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via the integration test, but you're right that adding to the unit tests makes sense too. bq. How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about? The NPE occurred in the previous code because we had the following sequence: - createNode succeeded - called ZKFC becomeActive() callback -- becomeActive() throws exception -- ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right behavior) --- quitElection() nulled out zkClient - ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been nulled out. The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics. bq. What is the purpose of adding the sleep? Could you please elaborate? Without the sleep, it will do a tight loop retrying to become active. This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active. I will add a comment to this effect in the code. ZKFailoverController doesn't handle failure to become active correctly -- Key: HADOOP-8220 URL: https://issues.apache.org/jira/browse/HADOOP-8220 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8220.txt The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock
[ https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239685#comment-13239685 ] Todd Lipcon commented on HADOOP-8218: - I'm fine with that, too. Suresh/Tom? Pick your patch, I'll do it. I just want to get something committed today to fix the failing tests. RPC.closeProxy shouldn't throw error when closing a mock Key: HADOOP-8218 URL: https://issues.apache.org/jira/browse/HADOOP-8218 Project: Hadoop Common Issue Type: Bug Components: ipc, test Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8218.txt, hadoop-8218.txt HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if called on an object which doesn't implement Closeable. Unfortunately, we use mock objects in many test cases, and those mocks don't implement Closeable. This is causing TestZKFailoverController to fail in trunk, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239957#comment-13239957 ] Todd Lipcon commented on HADOOP-8220: - bq. Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions? You got it. Exceptions are generally preferred for cases like this -- since we have to handle the error condition regardless of whether it's a usual error or whether it was something like a NPE or other truly exceptional condition. So even with a boolean return type, we'd need a try/catch clause. Does that make sense? (I also had originally made it return boolean but then changed it to an exception) bq. I did not understand where the tight loop is? Do you mean (Elector gets lock-ZKFC fails to becomes active)? Yep. In my test I saw that the standby would retry in a tight loop like that: # Succeed in getting lock # Call becomeActive() # drop ZK session (lock disappears) # reconnect to ZK # Goto 1 I simply inserted a sleep between dropping the connection and reconnecting. This gives the old active a better chance to become active again (or if there is a third node in the future, it would have a chance to take the lock). In the future we may want to add some random jitter and exponential backoff, but at this point let's keep it simple. ZKFailoverController doesn't handle failure to become active correctly -- Key: HADOOP-8220 URL: https://issues.apache.org/jira/browse/HADOOP-8220 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8220.txt The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock
[ https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240162#comment-13240162 ] Todd Lipcon commented on HADOOP-8218: - Since the patch is up, and people seem OK with it, I'll commit the version Tom suggested (the latter patch) RPC.closeProxy shouldn't throw error when closing a mock Key: HADOOP-8218 URL: https://issues.apache.org/jira/browse/HADOOP-8218 Project: Hadoop Common Issue Type: Bug Components: ipc, test Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8218.txt, hadoop-8218.txt HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if called on an object which doesn't implement Closeable. Unfortunately, we use mock objects in many test cases, and those mocks don't implement Closeable. This is causing TestZKFailoverController to fail in trunk, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly
[ https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238125#comment-13238125 ] Todd Lipcon commented on HADOOP-8202: - Instead of adding the instanceof check anywhere we use an object that might be a mock, can we instead change the protocol interfaces themselves to extend Closeable? That will make sure that any proxy implementations themselves take care of extending it, and also will solve the mock issue (since the mock itself will then also extend Closeable). stopproxy() is not closing the proxies correctly Key: HADOOP-8202 URL: https://issues.apache.org/jira/browse/HADOOP-8202 Project: Hadoop Common Issue Type: Bug Components: ipc Affects Versions: 0.24.0 Reporter: Hari Mankude Assignee: Hari Mankude Priority: Minor Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, HADOOP-8202-3.patch, HADOOP-8202.patch, HADOOP-8202.patch I was running testbackupnode and noticed that NNprotocol proxy was not being closed. Talked with Suresh and he observed that most of the protocols do not implement ProtocolTranslator and hence the logic in stopproxy() does not work. Instead, since all of them are closeable, Suresh suggested that closeable property should be used at close. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly
[ https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238597#comment-13238597 ] Todd Lipcon commented on HADOOP-8202: - Sure, go ahead and commit. Thanks. stopproxy() is not closing the proxies correctly Key: HADOOP-8202 URL: https://issues.apache.org/jira/browse/HADOOP-8202 Project: Hadoop Common Issue Type: Bug Components: ipc Affects Versions: 0.24.0 Reporter: Hari Mankude Assignee: Hari Mankude Priority: Minor Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, HADOOP-8202-3.patch, HADOOP-8202.patch, HADOOP-8202.patch I was running testbackupnode and noticed that NNprotocol proxy was not being closed. Talked with Suresh and he observed that most of the protocols do not implement ProtocolTranslator and hence the logic in stopproxy() does not work. Instead, since all of them are closeable, Suresh suggested that closeable property should be used at close. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8131) FsShell put doesn't correctly handle a non-existent dir
[ https://issues.apache.org/jira/browse/HADOOP-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238646#comment-13238646 ] Todd Lipcon commented on HADOOP-8131: - If it's not too huge a pain, I'd be in favor of a deprecated config flag which restores the old behavior (while emitting a warning that it's deprecated and to be removed in a future version). This will help people migrate to 0.23, since I'm sure there are lots of cases where people have shell scripts running as part of production workflows. FsShell put doesn't correctly handle a non-existent dir --- Key: HADOOP-8131 URL: https://issues.apache.org/jira/browse/HADOOP-8131 Project: Hadoop Common Issue Type: Bug Affects Versions: 0.23.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 0.23.2 Attachments: HADOOP-8131.patch, HADOOP-8131.patch, HADOOP-8131.patch, HADOOP-8131.patch {noformat} $ hadoop fs -ls ls: `.': No such file or directory $ hadoop fs -put file $ hadoop fs -ls Found 1 items -rw-r--r-- 1 kihwal supergroup 2076 2011-11-04 10:37 .._COPYING_ {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8131) FsShell put doesn't correctly handle a non-existent dir
[ https://issues.apache.org/jira/browse/HADOOP-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238677#comment-13238677 ] Todd Lipcon commented on HADOOP-8131: - bq. Just to confirm: you mean a real conf key, not a cmdline flag, right? Yep - something that could be set system-wide in core-site.xml. When users upgrade, they expect they may have to tweak some confs for the new version, but it's harder to ask them to change all of their shell scripts. bq. In either case it will be a change now or change later scenario Right. The idea is that they would have some warning (a full major version) before their code stops working. Our general policy is to only make the breaking change after having the deprecated support for a full major version -- in which case it would go away in 0.24.0. bq. Would this bring back the issue of left out _temporary dirs? (MAPREDUCE-1272) I would think the MR task would be using the new non-deprecated API which doesn't recursively create parents. FsShell put doesn't correctly handle a non-existent dir --- Key: HADOOP-8131 URL: https://issues.apache.org/jira/browse/HADOOP-8131 Project: Hadoop Common Issue Type: Bug Affects Versions: 0.23.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Fix For: 0.23.2 Attachments: HADOOP-8131.patch, HADOOP-8131.patch, HADOOP-8131.patch, HADOOP-8131.patch {noformat} $ hadoop fs -ls ls: `.': No such file or directory $ hadoop fs -put file $ hadoop fs -ls Found 1 items -rw-r--r-- 1 kihwal supergroup 2076 2011-11-04 10:37 .._COPYING_ {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8212) Improve ActiveStandbyElector's behavior when session expires
[ https://issues.apache.org/jira/browse/HADOOP-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238748#comment-13238748 ] Todd Lipcon commented on HADOOP-8212: - Sure, happy to address post-commit. Sorry for moving quick - trying to get at least an initial implementation of auto failover committed quickly, and we can continue to improve and fix it up. Improve ActiveStandbyElector's behavior when session expires Key: HADOOP-8212 URL: https://issues.apache.org/jira/browse/HADOOP-8212 Project: Hadoop Common Issue Type: Improvement Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Fix For: 0.23.3, 0.24.0 Attachments: hadoop-8212.txt, hadoop-8212.txt Currently when the ZK session expires, it results in a fatal error being sent to the application callback. This is not the best behavior -- for example, in the case of HA, if ZK goes down, we would like the current state to be maintained, rather than causing either NN to abort. When the ZK clients are able to reconnect, they should sort out the correct leader based on the normal locking schemes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover
[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238915#comment-13238915 ] Todd Lipcon commented on HADOOP-8217: - My thinking for the solution is the following: - add a parameter to transitionToStandby/transitionToActive which is a {{long logicalTime}} - when the ZKFC acquires the lock znode, it makes a note of the zxid (ZK transaction ID) - when it then asks the old active to go to standby, or asks its own node to go active, it includes the zxid - the NN itself maintains a record of the highest zxid it has heard. If it receives a state transition request with an older zxid, it ignores it. This would solve the race as described, since when ZKFC2 calls NN1.transitionToStandby(), it hands NN1 a higher zxid than ZKFC1 saw. So when ZKFC1 then asks it to go active, the request is denied. There is still potentially some race involving the NNs restarting quickly and forgetting the highest zxid. I'm not sure whether the right solution there is to record the info persistently, or to attach a UUID to each NN startup, and use that to make sure we don't target a newer instance of a NN with an RPC that was meant for an earlier one. Other creative solutions appreciated :) Edge case split-brain race in ZK-based auto-failover Key: HADOOP-8217 URL: https://issues.apache.org/jira/browse/HADOOP-8217 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon As discussed in HADOOP-8206, the current design for automatic failover has the following race: - ZKFC1 gets active lock - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping) - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8206) Common portion of ZK-based failover controller
[ https://issues.apache.org/jira/browse/HADOOP-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238982#comment-13238982 ] Todd Lipcon commented on HADOOP-8206: - bq. Makes sense to me. One question, though - we seem to be inconsistently using IllegalArgumentException and HadoopIllegalArgumentException. Is there any good reason for that? I'm not entirely sure -- looking across the code as a whole, we have a 10:1 ratio of IllegalArgumentException vs HadoopIllegalArgumentException. So I'm erring on the side of what's used more often, except in a few places where we directly expose it as a potentially user-visible error (like bad command line arguments). Common portion of ZK-based failover controller -- Key: HADOOP-8206 URL: https://issues.apache.org/jira/browse/HADOOP-8206 Project: Hadoop Common Issue Type: New Feature Components: ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8206.txt, hadoop-8206.txt, hadoop-8206.txt This JIRA is for the Common (generic) portion of HDFS-2185. It can't run on its own, but this JIRA will include unit tests using mock/dummy services. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly
[ https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239120#comment-13239120 ] Todd Lipcon commented on HADOOP-8202: - This broke TestZKFailoverController, since it's now getting IllegalArgumentException trying to close the proxy. stopproxy() is not closing the proxies correctly Key: HADOOP-8202 URL: https://issues.apache.org/jira/browse/HADOOP-8202 Project: Hadoop Common Issue Type: Bug Components: ipc Affects Versions: 0.24.0 Reporter: Hari Mankude Assignee: Hari Mankude Priority: Minor Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, HADOOP-8202-3.patch, HADOOP-8202-4.patch, HADOOP-8202.patch, HADOOP-8202.patch I was running testbackupnode and noticed that NNprotocol proxy was not being closed. Talked with Suresh and he observed that most of the protocols do not implement ProtocolTranslator and hence the logic in stopproxy() does not work. Instead, since all of them are closeable, Suresh suggested that closeable property should be used at close. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock
[ https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239121#comment-13239121 ] Todd Lipcon commented on HADOOP-8218: - I see three options: 1) Anywhere we call RPC.closeProxy, we check if (foo instanceof Closeable) { ... } first. But, that defeats the whole purpose of throwing the exception when we pass non-closeables, so we might as well just revert the behavior back to the original rather than does this. 2) In RPC.closeProxy, if the object doesn't implement Closeable, check if the proxy is a mock object. We can do this by looking for the string EnhancerByMockitoWithCGLIB in the class name. If we see that, pass through. 3) Anywhere we mock out an IPC protocol, we could use the syntax {{mock(FooProtocol.class, withSettings().extraInterfaces(Closeable.class));}}. I am not a fan of this, since it leaks the issue out to all of the test code, rather than localizing the workaround in the one place that matters. Plus, newer users of the mock framework won't know this advanced usage syntax (I had to google for a while to figure it out) So, I plan to implement #2. RPC.closeProxy shouldn't throw error when closing a mock Key: HADOOP-8218 URL: https://issues.apache.org/jira/browse/HADOOP-8218 Project: Hadoop Common Issue Type: Bug Components: ipc, test Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if called on an object which doesn't implement Closeable. Unfortunately, we use mock objects in many test cases, and those mocks don't implement Closeable. This is causing TestZKFailoverController to fail in trunk, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock
[ https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239171#comment-13239171 ] Todd Lipcon commented on HADOOP-8218: - bq. Todd, can the mock object implement a new interface that extends HAServiceProtocol and Closeable? Will that solve the problem? That avoids the advanced syntax, but requires that you make such a fake interface everywhere you mock a protocol, which again is somewhat counter-intuitive. bq. #2 makes the main code aware of test specifics, which isn't a good idea. How about doing #3 by creating a helper method that encapsulates that code in one place? I was thinking about doing that... ie a MockitoUtils.mockIpcProtocol(FooProtocol.class). Since it seems people like this idea better than #2, I'll prepare such a patch. RPC.closeProxy shouldn't throw error when closing a mock Key: HADOOP-8218 URL: https://issues.apache.org/jira/browse/HADOOP-8218 Project: Hadoop Common Issue Type: Bug Components: ipc, test Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8218.txt HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if called on an object which doesn't implement Closeable. Unfortunately, we use mock objects in many test cases, and those mocks don't implement Closeable. This is causing TestZKFailoverController to fail in trunk, for example. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
[ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239184#comment-13239184 ] Todd Lipcon commented on HADOOP-8220: - Tests failing due to HADOOP-8218 ZKFailoverController doesn't handle failure to become active correctly -- Key: HADOOP-8220 URL: https://issues.apache.org/jira/browse/HADOOP-8220 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Critical Attachments: hadoop-8220.txt The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8207) createproxy() in TestHealthMonitor is throwing NPE
[ https://issues.apache.org/jira/browse/HADOOP-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237788#comment-13237788 ] Todd Lipcon commented on HADOOP-8207: - I think this is dup of HADOOP-8204 createproxy() in TestHealthMonitor is throwing NPE -- Key: HADOOP-8207 URL: https://issues.apache.org/jira/browse/HADOOP-8207 Project: Hadoop Common Issue Type: Bug Components: ha Affects Versions: 0.24.0 Reporter: Hari Mankude Priority: Minor Looking at the test log output, createproxy in testhealthmonitor is triggering NPE resulting null proxy. This creates other test failures. 2012-03-24 22:16:11,591 FATAL ha.HealthMonitor (HealthMonitor.java:uncaughtException(268)) - Health monitor failed java.lang.NullPointerException at org.apache.hadoop.ha.TestHealthMonitor$1.createProxy(TestHealthMonitor.java:75) at org.apache.hadoop.ha.HealthMonitor.tryConnect(HealthMonitor.java:171) at org.apache.hadoop.ha.HealthMonitor.loopUntilConnected(HealthMonitor.java:158) at org.apache.hadoop.ha.HealthMonitor.access$500(HealthMonitor.java:52) at org.apache.hadoop.ha.HealthMonitor$MonitorDaemon.run(HealthMonitor.java:278) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8208) Disallow self failover
[ https://issues.apache.org/jira/browse/HADOOP-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237979#comment-13237979 ] Todd Lipcon commented on HADOOP-8208: - Looks good. I just reverted HADOOP-8193 since it caused some other test failures, but when it is recommitted we can commit this. +1 in advance Disallow self failover -- Key: HADOOP-8208 URL: https://issues.apache.org/jira/browse/HADOOP-8208 Project: Hadoop Common Issue Type: Bug Components: ha Reporter: Eli Collins Assignee: Eli Collins Attachments: hadoop-8208.txt, hdfs-3145.txt It is currently possible for users to make a standby NameNode failover to itself and become active. We shouldn't allow this to happen in case operators mistype and miss the fact that there are now two active NNs. {noformat} bash-4.1$ hdfs haadmin -ns ha-nn-uri -failover nn2 nn2 Failover from nn2 to nn2 successful {noformat} After the failover above, nn2 will be active. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly
[ https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237720#comment-13237720 ] Todd Lipcon commented on HADOOP-8202: - I think maintaining the ability to support mockito spies/mocks for proxies is important. We use it for simulating all kinds of failure conditions -- I'm surprised you didn't have a lot of HDFS failures from the same issue. stopproxy() is not closing the proxies correctly Key: HADOOP-8202 URL: https://issues.apache.org/jira/browse/HADOOP-8202 Project: Hadoop Common Issue Type: Bug Components: ipc Affects Versions: 0.24.0 Reporter: Hari Mankude Assignee: Hari Mankude Priority: Minor Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, HADOOP-8202.patch, HADOOP-8202.patch I was running testbackupnode and noticed that NNprotocol proxy was not being closed. Talked with Suresh and he observed that most of the protocols do not implement ProtocolTranslator and hence the logic in stopproxy() does not work. Instead, since all of them are closeable, Suresh suggested that closeable property should be used at close. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236934#comment-13236934 ] Todd Lipcon commented on HADOOP-8163: - Hi Bikas. I think your ideas have some merit, especially with regard to a fully general election framework. But since we only have one user of this framework at this point (HDFS) and we currently only support a single standby node, I would prefer to punt these changes to another JIRA as additional improvements. This will let us move forward with the high priority task of auto failover for HA NNs, rather than getting distracted making this extremely general. bq. Secondly, we are performing blocking calls on the ZKClient callback that happens on the ZK threads. It is advisable to not block ZK client threads for long This is only the case if you have other operations that are waiting on timely delivery of callbacks. In the case of the election framework, all of our notifications from ZK have to be received in-order and processed sequentially, or else we have a huge explosion of possible interactions to worry about. Doing blocking calls in the callbacks will _not_ result in lost ZK leases, etc. To quote from the ZK programmer's guide: All IO happens on the IO thread (using Java NIO). All event callbacks happen on the event thread. Session maintenance such as reconnecting to ZooKeeper servers and maintaining heartbeat is done on the IO thread. Responses for synchronous methods are also processed in the IO thread. All responses to asynchronous methods and watch events are processed on the event thread... Callbacks do not block the processing of the IO thread or the processing of the synchronous calls bq. Thirdly, how about using the setData(breadcrumb, appData, version)? Let me see about making this change. Like you said, it's a good safety check. Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8060) Add a capability to use of consistent checksums for append and copy
[ https://issues.apache.org/jira/browse/HADOOP-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236965#comment-13236965 ] Todd Lipcon commented on HADOOP-8060: - Doing shallow conf comparison as part of the FS key seems a bit dangerous -- I'm guessing we'll end up with a lot of leakage issues in long running daemons like the NM/RM. Anyone else have some other ideas how to deal with this? I don't think the CreateFlag idea is bad -- maybe better than futzing with the cache. Add a capability to use of consistent checksums for append and copy --- Key: HADOOP-8060 URL: https://issues.apache.org/jira/browse/HADOOP-8060 Project: Hadoop Common Issue Type: Bug Components: fs, util Affects Versions: 0.23.0, 0.23.1, 0.24.0 Reporter: Kihwal Lee Assignee: Kihwal Lee Fix For: 0.23.2, 0.24.0 After the improved CRC32C checksum feature became default, some of use cases involving data movement are no longer supported. For example, when running DistCp to copy from a file stored with the CRC32 checksum to a new cluster with the CRC32C set to default checksum, the final data integrity check fails because of mismatch in checksums. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236970#comment-13236970 ] Todd Lipcon commented on HADOOP-8163: - bq. In my experience API's once made are hard to change. It would be hard for someone to change the control flow later once important services like NN HA depend on the current flow. So punting it for the future would be quite a distant future indeed Given this is an internal API, there shouldn't be any resistance to changing it in the future. It's marked Private/Evolving, meaning that there aren't guarantees of compatibility to external consumers, and that even for internal consumers it's likely to change as use cases evolve. I'll file a follow-up JIRA to consider your recommended API changes, OK? Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8193) Refactor FailoverController/HAAdmin code to add an abstract class for target services
[ https://issues.apache.org/jira/browse/HADOOP-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236984#comment-13236984 ] Todd Lipcon commented on HADOOP-8193: - Also ran findbugs on common and HDFS, there were no additional warnings. Refactor FailoverController/HAAdmin code to add an abstract class for target services --- Key: HADOOP-8193 URL: https://issues.apache.org/jira/browse/HADOOP-8193 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8193.txt, hadoop-8193.txt In working at HADOOP-8077, HDFS-3084, and HDFS-3072, I ran into various difficulties which are an artifact of the current design. A few of these: - the service name is resolved from the logical name (eg ns1.nn1) to an IP address at the outer layer of DFSHAAdmin -- this means it's difficult to provide the logical name ns1.nn1 to fence scripts (HDFS-3084) -- this means it's difficult to configure fencing method per-namespace (since the FailoverController doesn't know what the namespace is) (HADOOP-8077) - the configuration for HA HDFS is weirdly split between core-site and hdfs-site, even though most users see this as an HDFS feature. For example, users expect to configure NN fencing configurations in hdfs-site, and expect the keys to have a dfs.* prefix - proxies are constructed at the outer layer of the admin commands. This means it's impossible for the inner layers (eg FailoverController.failover) to re-construct proxies with different timeouts (HDFS-3072) The proposed refactor is to add a new interface (tentatively named HAServiceTarget) which refers to target for one of the admin commands. An instance of this class is responsible for creating proxies, creating fencers, mapping back to a logical name, etc. The HDFS implementation of this class can then provide different results based on the particular nameservice, can use HDFS-specific configuration prefixes, etc. Using this class as the argument for fencing methods also makes the API more evolvable in the future, since we can add new getters to HAServiceTarget (whereas the current InetSocketAddress is quite limiting) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235937#comment-13235937 ] Todd Lipcon commented on HADOOP-8163: - Hi Bikas. To be clear, I did not remove any of your test cases. I just cleaned it up to be implemented much more simply. It looked like you had some confusion about the semantics of inner classes, etc -- eg using static variables where unnecessary, etc (iirc you are new to Java, so perfectly understandable!). All of the same corner cases you tested are still tested, just with fewer lines of code and fitting our normal coding conventions. Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.24.0, 0.23.3 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt, hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8060) Add a capability to use of consistent checksums for append and copy
[ https://issues.apache.org/jira/browse/HADOOP-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235984#comment-13235984 ] Todd Lipcon commented on HADOOP-8060: - Hi Kihwal. What about making the checksum type part of the FileSystem cache key (like we do for UGI?) It seems like we would have similar problems with configurable timeouts, etc. Add a capability to use of consistent checksums for append and copy --- Key: HADOOP-8060 URL: https://issues.apache.org/jira/browse/HADOOP-8060 Project: Hadoop Common Issue Type: Bug Components: fs, util Affects Versions: 0.23.0, 0.24.0, 0.23.1 Reporter: Kihwal Lee Assignee: Kihwal Lee Fix For: 0.24.0, 0.23.2 After the improved CRC32C checksum feature became default, some of use cases involving data movement are no longer supported. For example, when running DistCp to copy from a file stored with the CRC32 checksum to a new cluster with the CRC32C set to default checksum, the final data integrity check fails because of mismatch in checksums. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236116#comment-13236116 ] Todd Lipcon commented on HADOOP-8163: - bq. Am I missing something, or are ensureBaseZNode and baseNodeExists only called by the tests? If so, we should probably relocate them, or at least mark them @VisibleForTesting if they can't be moved for some reason. These are used by my forthcoming patch for the ZK-based automatic failover controller. The ZKFC has a -formatZK flag which calls through to ensureBaseZNode. Once this gets committed I'll move forward uploading the patch there. I fixed the other three of ATM's comments. I'll wait til tomorrow to commit this in case Bikas has any additional feedback. Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236213#comment-13236213 ] Todd Lipcon commented on HADOOP-8163: - ilePath. zkMostRecentFilePath is open to being misunderstood. Same for MOST_RECENT_FILENAME. Done bq. actually MostRecent seems to be a misnomer to me. I think it actually is LockOwnerInfo/LeaderInfo. zkLockOwnerInfoPath/tryDeleteLeaderInfo etc. It's not always the lock owner, though. Basically, we go through the following states: ||Time step||Lock node||MostRecentActive||Description|| |1|-|-|Startup| |2|Node A|-|Node A acquires active lock |3|Node A|Node A|..and writes its own info| |4|-|Node A|A loses its ZK lease| |5|Node B|Node A|Node B acquires active lock |6|Node B|-|Node B fences node A| |7|Node B|Node B|Node B writes its info| So, in steps 3 and 7, calling it LeaderInfo or LockOwnerInfo makes sense. But in steps 4 and 5, it's the PreviousLeaderInfo. Perhaps just renaming to LeaderBreadcrumb or something makes more sense, since it's basically a bread crumb left around by the previous leader so that future leaders know its info. bq. why is ensureBaseNode() needed? In it we are creating a new set of znodes with the given zkAcls which may or may not be the correct thing. eg. if the admin simply forgot to create the appropriate znode path before starting the service it might be ok to fail. Instead of trying to create the path ourselves with permissions that may or may not be appropriate for the entire path. I would be wary of doing this. What is the use case? The use case is a ZKFailoverController -formatZK command line tool that I'm building into the ZKFC code. The thinking is that most administrators won't want to go into the ZK CLI to manually create the parent znode while installing HDFS. Instead, they'd rather just issue this simple command. In the case that they want to have varying permissions across the path, or some more complicated ACL, then they'll have to use the ZK CLI, but for the common case I think this will make deployment much simpler. bq. consider renaming baseNodeExists() to parentNodeExists() or renaming the parentZnodeName parameter in the constructor to baseNode for consistency. Perhaps this could be called in the constructor to check that the znode exists and be done with config issues. No need for ensureBaseNode() above. Renamed to parentZNodeExists and ensureParentZNode bq. this must be my newbie java skills but I find something like - prefixPath.append(/).append(pathParts[index]) or znodeWorkingDir.subString(0, znodeWorkingDir.nextIndexOf('/')) - more readable than prefixPath = Joiner.on(/).join(Arrays.asList(pathParts).subList(0, i)). It might also be more efficient but thats not relevant for this situation. Agreed, fixed. bq. public synchronized void quitElection(boolean needFence) - Dont we want to delete the permanent znode for standby's too? Why check if state is active. It anyways calls a tryDelete* method that should be harmless. If the node is standby, then the permanent znode refers to the current lockholder. So deleting it would incorrectly signify that whoever is active doesn't need to be fenced if it crashes. bq. tryDeleteMostRecentNode() - From my understanding of tryFunction - this function should be not really be asserting that some state holds. If it should assert then we should remove try from the name. The difference here is this: the assert() guards against programmer error. It is a mistake to call this function when you aren't active (see above comment). But if there is a ZK error (like the session got lost) it's OK to fail to delete it, since it just means that the node will get fenced. bq. in zkDoWithRetries there is a NUM_RETRIES field that could be used instead of 3. Fixed bq. why are we exposing public synchronized ZooKeeper getZKClient()? Removed bq. the following code seems to have issues... snip... While that is happening, the state of the world changes and this elector is not longer the lock owner. When appClient.fenceOldActive(data) will complete then the code will go ahead and delete the lockOwnerZnode at zkMostRecentFilePath. This node could be from the new leader who had successfully fenced and become active. The version number parameter might accidentally save us but would likely be 0 all the time. This scenario is impossible for the following reason: If the state of the world changed and this node was no longer active, the only possible reason for that is that the node lost its ZK session lease. If that's the case, then it won't be able to issue any further commands from that client (see my conversation with Hari above) bq. what happens if the leader lost the lock, tried to delete its znode, failed to do so, exited anyways, then became the next owner and found the existing mostrecent znode. I think it will try to fence itself
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236276#comment-13236276 ] Todd Lipcon commented on HADOOP-8163: - bq. So paranoid admin deletes the lock hoping a new master might solve this If the admin is mucking about in ZK, then all bets are off. The proper thing for the admin to do is to kill B's failover controller, not to go delete a znode. bq. Yes. I am suggesting to do this within the Elector and not at the ZKFailoverController level. The self compare approach would be reasonable as long as we can assure ourselves that appData will not be same across different candidates K, that's the approach in the latest patch I uploaded. Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.23.3, 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8157) TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE
[ https://issues.apache.org/jira/browse/HADOOP-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235019#comment-13235019 ] Todd Lipcon commented on HADOOP-8157: - I think I understand this bug. It's probably due to an error in HADOOP-6502. Patch and explanation en route. TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE - Key: HADOOP-8157 URL: https://issues.apache.org/jira/browse/HADOOP-8157 Project: Hadoop Common Issue Type: Test Affects Versions: 0.24.0 Reporter: Eli Collins Assignee: Todd Lipcon Saw TestRPCCallBenchmark#testBenchmarkWithWritable fail with the following on jenkins: Caused by: java.lang.RuntimeException: IPC server unable to read call parameters: readObject can't find class java.lang.String -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233638#comment-13233638 ] Todd Lipcon commented on HADOOP-8163: - Hi Hari. I like your ideas about using this info znode for failover/restart preferences. But I don't think it's a requirement for a first draft, and it's not clear what you mean by 'state equalization' in your second point. We don't currently use this terminology. Are you OK with the current design for a first draft? We can add improvements later -- I'm using a protobuf for the info in ZK so we can evolve the information contained within without breaking compatibility. Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.24.0, 0.23.3 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8183) Stop using mapred.used.genericoptionsparser to avoid unnecessary warnings
[ https://issues.apache.org/jira/browse/HADOOP-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232777#comment-13232777 ] Todd Lipcon commented on HADOOP-8183: - +1 Stop using mapred.used.genericoptionsparser to avoid unnecessary warnings --- Key: HADOOP-8183 URL: https://issues.apache.org/jira/browse/HADOOP-8183 Project: Hadoop Common Issue Type: Improvement Components: util Affects Versions: 0.23.0 Reporter: Harsh J Assignee: Harsh J Priority: Minor Attachments: HADOOP-8183.patch Its about time we stopped the following from appearing in 0.23/trunk: {code} 12/03/19 20:53:51 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8151) Error handling in snappy decompressor throws invalid exceptions
[ https://issues.apache.org/jira/browse/HADOOP-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232225#comment-13232225 ] Todd Lipcon commented on HADOOP-8151: - +1, patch looks good to me. Please upload a trunk patch as well. Error handling in snappy decompressor throws invalid exceptions --- Key: HADOOP-8151 URL: https://issues.apache.org/jira/browse/HADOOP-8151 Project: Hadoop Common Issue Type: Bug Components: io, native Affects Versions: 0.24.0, 1.0.2 Reporter: Todd Lipcon Assignee: Matt Foley Attachments: HADOOP-8151-branch-1.0.patch SnappyDecompressor.c has the following code in a few places: {code} THROW(env, Ljava/lang/InternalError, Could not decompress data. Buffer length is too small.); {code} this is incorrect, though, since the THROW macro doesn't need the L before the class name. This results in a ClassNotFoundException for Ljava.lang.InternalError being thrown, instead of the intended exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229589#comment-13229589 ] Todd Lipcon commented on HADOOP-8163: - bq. The question I had was how is the info znode creation prevented when the client does not have the ephemeral lock znode? Is this ensured in the zk client or at the zookeeper? This is ensured by ZooKeeper. The only reason the ephemeral node would disappear is if the session was expired. This means the leader has marked the session as such -- and thus, you can no longer issue commands under that same session. To be sure, I just double checked with Pat Hunt from the ZK team. Apparently there was a rare race condition bug ZOOKEEPER-1208 fixed in 3.3.4/3.4.0 about this exact case: https://issues.apache.org/jira/browse/ZOOKEEPER-1208?focusedCommentId=13149787page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13149787 ... but since Hadoop will probably need the krb5 auth from ZK 3.4, it seems a reasonable requirement to need at least that version. Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.24.0, 0.23.3 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8163.txt When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active
[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228043#comment-13228043 ] Todd Lipcon commented on HADOOP-8163: - The design here is pretty simple: *In ZK*: - add an additional znode (the info znode) next to the lock znode, which is a PERSISTENT node with the same data. *Upon successfully acquiring the lock znode:* - check if there exists an info znode -- if so, the previous active did not exit cleanly. Call an application-provided fencing hook, providing the data from the info znode -- If the fencing hook succeeds, delete the info znode - create an info znode with one's own app data - proceed to call the {{becomeActive}} API on the app *Upon crashing:* - the ephemeral node disappears - by the order of events above, if the application has become active, then it will have created an info znode so whoever recovers knows to fence it *Upon graceful exit:* - first transition out of active mode (e.g. shutdown the NN) - then delete the info node - then close the session (deleting the ephemeral node) Improve ActiveStandbyElector to provide hooks for fencing old active Key: HADOOP-8163 URL: https://issues.apache.org/jira/browse/HADOOP-8163 Project: Hadoop Common Issue Type: Improvement Components: ha Affects Versions: 0.24.0, 0.23.3 Reporter: Todd Lipcon Assignee: Todd Lipcon When a new node becomes active in an HA setup, it may sometimes have to take fencing actions against the node that was formerly active. This JIRA extends the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK directory, which acts as a second copy of the active node's information. Then, if the active loses its ZK session, the next active to be elected may easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-7788) HA: Simple HealthMonitor class to watch an HAService
[ https://issues.apache.org/jira/browse/HADOOP-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228100#comment-13228100 ] Todd Lipcon commented on HADOOP-7788: - Oh, sorry, I also left in the main() method. Though the test covers the code fairly well, having a main() method is helpful for manual testing of some things like kill -STOPping the monitored process and making sure timeouts are handled correctly, etc. That's hard to mock out. HA: Simple HealthMonitor class to watch an HAService Key: HADOOP-7788 URL: https://issues.apache.org/jira/browse/HADOOP-7788 Project: Hadoop Common Issue Type: New Feature Components: ha Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-7788.txt, hdfs-2524.txt This is a utility class which will be part of the FailoverController. The class starts a daemon thread which periodically monitors an HAService, calling its monitorHealth function. It then generates callbacks into another class when the health status changes (eg the RPC fails or the service returns a HealthCheckFailedException) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8154) DNS#getIPs shouldn't silently return the local host IP for bogus interface names
[ https://issues.apache.org/jira/browse/HADOOP-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225587#comment-13225587 ] Todd Lipcon commented on HADOOP-8154: - Under what circumstances would the following code trigger? {code} +} catch (SocketException e) { + LOG.warn(I/O error finding interface + strInterface + + : + e.getMessage()); {code} Seems strange that we fallback to the default there, but throw an exception if we specify an invalid one. DNS#getIPs shouldn't silently return the local host IP for bogus interface names Key: HADOOP-8154 URL: https://issues.apache.org/jira/browse/HADOOP-8154 Project: Hadoop Common Issue Type: Bug Components: conf Reporter: Eli Collins Assignee: Eli Collins Attachments: hadoop-8154.txt DNS#getIPs silently returns the local host IP for bogus interface names. In this case let's throw an UnknownHostException. This is technically an incompatbile change. I suspect the current behavior was origininally introduced so the interface name default works w/o explicitly checking for it. It may also be used in cases where someone is using a shared config file and an option like dfs.datanode.dns.interface or hbase.master.dns.interface and eg interface eth3 that some hosts don't have, though I think silently ignorning this is the wrong behavior (those hosts should be configured to use a different interface). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-7806) [DNS] Support binding to sub-interfaces
[ https://issues.apache.org/jira/browse/HADOOP-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225768#comment-13225768 ] Todd Lipcon commented on HADOOP-7806: - +1 pending results on new patch [DNS] Support binding to sub-interfaces --- Key: HADOOP-7806 URL: https://issues.apache.org/jira/browse/HADOOP-7806 Project: Hadoop Common Issue Type: New Feature Components: util Affects Versions: 0.24.0 Reporter: Harsh J Assignee: Harsh J Fix For: 0.24.0 Attachments: HADOOP-7806.patch, HADOOP-7806.patch, hadoop-7806.txt Right now, with the {{DNS}} class, we can look up IPs of provided interface names ({{eth0}}, {{vm1}}, etc.). However, it would be useful if the I/F - IP lookup also took a look at subinterfaces ({{eth0:1}}, etc.) and allowed binding to only a specified subinterface / virtual interface. This should be fairly easy to add, by matching against all available interfaces' subinterfaces via Java. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8157) TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE
[ https://issues.apache.org/jira/browse/HADOOP-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225873#comment-13225873 ] Todd Lipcon commented on HADOOP-8157: - This failure is super-goofy. My hunch is it's something to do with non-threadsafe use of classloaders or some other bad synchronization, but I don't have much to go on. Any ideas? TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE - Key: HADOOP-8157 URL: https://issues.apache.org/jira/browse/HADOOP-8157 Project: Hadoop Common Issue Type: Test Affects Versions: 0.24.0 Reporter: Eli Collins Saw TestRPCCallBenchmark#testBenchmarkWithWritable fail with the following on jenkins: Caused by: java.lang.RuntimeException: IPC server unable to read call parameters: readObject can't find class java.lang.String -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8151) Error handling in snappy decompressor throws invalid exceptions
[ https://issues.apache.org/jira/browse/HADOOP-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224848#comment-13224848 ] Todd Lipcon commented on HADOOP-8151: - This bug seems to occur in lz4 as well. It also seems like the wrong kind of exception to throw - InternalError is for JVM-internal unexpected conditions. Error handling in snappy decompressor throws invalid exceptions --- Key: HADOOP-8151 URL: https://issues.apache.org/jira/browse/HADOOP-8151 Project: Hadoop Common Issue Type: Bug Components: io, native Affects Versions: 0.24.0, 1.0.2 Reporter: Todd Lipcon SnappyDecompressor.c has the following code in a few places: {code} THROW(env, Ljava/lang/InternalError, Could not decompress data. Buffer length is too small.); {code} this is incorrect, though, since the THROW macro doesn't need the L before the class name. This results in a ClassNotFoundException for Ljava.lang.InternalError being thrown, instead of the intended exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8153) Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up by job submission.
[ https://issues.apache.org/jira/browse/HADOOP-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224908#comment-13224908 ] Todd Lipcon commented on HADOOP-8153: - Looks like we need to override FileSystem.getCanonicalServiceName in DistributedFileSystem so that the canonical name is just the logical name, for the case of HA HDFS file systems. Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up by job submission. Key: HADOOP-8153 URL: https://issues.apache.org/jira/browse/HADOOP-8153 Project: Hadoop Common Issue Type: Bug Components: ha, security Affects Versions: 0.24.0 Reporter: Mingjie Lai Fix For: 0.24.0 When testing the combination of NN HA + security + yarn, I found that the mapred job submission cannot pick up the logic URI of a nameservice. I have logic URI configured in core-site.xml {code} property namefs.defaultFS/name valuehdfs://ns1/value /property {code} HDFS client can work with the HA deployment/configs: {code} [root@nn1 hadoop]# hdfs dfs -ls / Found 6 items drwxr-xr-x - hbase hadoop 0 2012-03-07 20:42 /hbase drwxrwxrwx - yarn hadoop 0 2012-03-07 20:42 /logs drwxr-xr-x - mapred hadoop 0 2012-03-07 20:42 /mapred drwxr-xr-x - mapred hadoop 0 2012-03-07 20:42 /mr-history drwxrwxrwt - hdfs hadoop 0 2012-03-07 21:57 /tmp drwxr-xr-x - hdfs hadoop 0 2012-03-07 20:42 /user {code} but cannot submit a mapred job with security turned on {code} [root@nn1 hadoop]# /usr/lib/hadoop/bin/yarn --config ./conf jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar randomwriter out Running 0 maps. Job started: Wed Mar 07 23:28:23 UTC 2012 java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:431) at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:312) at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:217) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:411) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:326) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1221) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) {code}0.24 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8135) Add ByteBufferReadable interface to FSDataInputStream
[ https://issues.apache.org/jira/browse/HADOOP-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221334#comment-13221334 ] Todd Lipcon commented on HADOOP-8135: - {code} + * @return - the number of bytes available to read from buf {code} style nit: no '-' here - maybe worth noting in the javadoc that many FS implementations may throw UnsupportedOperationException, and add that to the javadoc as well Add ByteBufferReadable interface to FSDataInputStream - Key: HADOOP-8135 URL: https://issues.apache.org/jira/browse/HADOOP-8135 Project: Hadoop Common Issue Type: New Feature Components: fs Reporter: Henry Robinson Assignee: Henry Robinson Attachments: HADOOP-8135.patch To prepare for HDFS-2834, it's useful to add an interface to FSDataInputStream (and others inside hdfs) that adds a read(ByteBuffer...) method as follows: {code} /** * Reads up to buf.remaining() bytes into buf. Callers should use * buf.limit(..) to control the size of the desired read. * * After the call, buf.position() should be unchanged, and therefore any data * can be immediately read from buf. * * @param buf * @return - the number of bytes available to read from buf * @throws IOException */ public int read(ByteBuffer buf) throws IOException; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8104) Inconsistent Jackson versions
[ https://issues.apache.org/jira/browse/HADOOP-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214192#comment-13214192 ] Todd Lipcon commented on HADOOP-8104: - Will this now break HBase or other projects which also use Jersey? HBase appears to use jersey 1.4. Inconsistent Jackson versions - Key: HADOOP-8104 URL: https://issues.apache.org/jira/browse/HADOOP-8104 Project: Hadoop Common Issue Type: Bug Reporter: Colin Patrick McCabe Assignee: Colin Patrick McCabe Attachments: HADOOP-8104.patch This is a maven build issue. Jersey 1.8 is pulling in version 1.7.1 of Jackson. Meanwhile, we are manually specifying that we want version 1.8 of Jackson in the POM files. This causes a conflict where Jackson produces unexpected results when serializing Map objects. How to reproduce: try this code: {quote} ObjectMapper mapper = new ObjectMapper(); MapString, Object m = new HashMapString, Object(); mapper.writeValue(new File(foo), m); {quote} You will get an exception: {quote} Exception in thread main java.lang.NoSuchMethodError: org.codehaus.jackson.type.JavaType.isMapLikeType()Z at org.codehaus.jackson.map.ser.BasicSerializerFactory.buildContainerSerializer(BasicSerializerFactory.java:396) at org.codehaus.jackson.map.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:267) {quote} Basically the inconsistent versions of various Jackson components are causing this NoSuchMethod error. As far as I know, this only occurs when serializing maps-- that's why it hasn't been found and fixed yet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8097) TestRPCCallBenchmark failing w/ port in use -handling badly
[ https://issues.apache.org/jira/browse/HADOOP-8097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212751#comment-13212751 ] Todd Lipcon commented on HADOOP-8097: - I'm not sure this is the best fix (relying on a different static port). A few other ideas: - change the benchmark so that if a port isn't specified, it binds to port 0, and then has the clients connect to whichever port gets bound - make sure it uses REUSEADDR so that it can still bind despite the TIME_WAIT sockets Either of those make sense? I honestly thought I'd written it to use port 0 but apparently I didn't :) TestRPCCallBenchmark failing w/ port in use -handling badly --- Key: HADOOP-8097 URL: https://issues.apache.org/jira/browse/HADOOP-8097 Project: Hadoop Common Issue Type: Bug Components: ipc Affects Versions: 0.24.0 Reporter: Steve Loughran Assignee: Steve Loughran Priority: Minor Fix For: 0.24.0 Attachments: HADOOP-8097.patch I'm seeing TestRPCCallBenchmark fail with port in use, which is probably related to some other test (race condition on shutdown?), but which isn't being handled that well in the test itself -although the log shows the binding exception, the test is failing on a connection timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8093) HadoopRpcRequestProto should not be serialize twice
[ https://issues.apache.org/jira/browse/HADOOP-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212307#comment-13212307 ] Todd Lipcon commented on HADOOP-8093: - This seems like a dup of HADOOP-8084, but the implementation in 8084 actually avoids one more copy than this. HadoopRpcRequestProto should not be serialize twice --- Key: HADOOP-8093 URL: https://issues.apache.org/jira/browse/HADOOP-8093 Project: Hadoop Common Issue Type: Improvement Components: ipc Affects Versions: 0.24.0, 0.23.2 Environment: Windows 7 Reporter: Changming Sun Attachments: HADOOP-8093.patch Original Estimate: 1m Remaining Estimate: 1m @Override public void write(DataOutput out) throws IOException { out.writeInt(message.toByteArray().length); out.write(message.toByteArray()); } The code is not effective. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8066) The full docs build intermittently fails
[ https://issues.apache.org/jira/browse/HADOOP-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208684#comment-13208684 ] Todd Lipcon commented on HADOOP-8066: - This is a regression, right? Any chance we could revert the commit that introduced it while we figure out the solution? Or introduce a workaround even if it's temporary and slows the build? It's bad to not get the nightly test results anymore. The full docs build intermittently fails Key: HADOOP-8066 URL: https://issues.apache.org/jira/browse/HADOOP-8066 Project: Hadoop Common Issue Type: Bug Components: build Affects Versions: 0.24.0 Reporter: Aaron T. Myers Assignee: Andrew Bayer See for example: https://builds.apache.org/job/Hadoop-Hdfs-trunk/954/ https://builds.apache.org/job/Hadoop-Common-trunk/317/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8069) Enable TCP_NODELAY by default for IPC
[ https://issues.apache.org/jira/browse/HADOOP-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209012#comment-13209012 ] Todd Lipcon commented on HADOOP-8069: - Hi Daryn. Your above descriptions sound right, except the nagle delay on Linux is 40ms rather than 200 (I think the dack delay is 200 though like you said). I hacked up something like my #4 yesterday morning but didn't really like the way I did it so I threw it away. I'll try again soon :) Enable TCP_NODELAY by default for IPC - Key: HADOOP-8069 URL: https://issues.apache.org/jira/browse/HADOOP-8069 Project: Hadoop Common Issue Type: Improvement Components: ipc Affects Versions: 0.23.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8069.txt I think we should switch the default for the IPC client and server NODELAY options to true. As wikipedia says: {quote} In general, since Nagle's algorithm is only a defense against careless applications, it will not benefit a carefully written application that takes proper care of buffering; the algorithm has either no effect, or negative effect on the application. {quote} Since our IPC layer is well contained and does its own buffering, we shouldn't be careless. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HADOOP-8069) Enable TCP_NODELAY by default for IPC
[ https://issues.apache.org/jira/browse/HADOOP-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209150#comment-13209150 ] Todd Lipcon commented on HADOOP-8069: - My hunch is that it's pretty small. I think the only RPC to the NN which would be at all frequent and cross the 8K boundary would be getListing(). On one production hbase cluster I collected metrics from a while back, getListing represented 8.3% of the RPCs. On one of our QA clusters that's been running MR workloads, it represents 2.3%. Unfortunately we don't have enough metrics to get any info on the size distribution of those responses. Would be interested to hear if some of your production clusters show a similar mix. Enable TCP_NODELAY by default for IPC - Key: HADOOP-8069 URL: https://issues.apache.org/jira/browse/HADOOP-8069 Project: Hadoop Common Issue Type: Improvement Components: ipc Affects Versions: 0.23.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Attachments: hadoop-8069.txt I think we should switch the default for the IPC client and server NODELAY options to true. As wikipedia says: {quote} In general, since Nagle's algorithm is only a defense against careless applications, it will not benefit a carefully written application that takes proper care of buffering; the algorithm has either no effect, or negative effect on the application. {quote} Since our IPC layer is well contained and does its own buffering, we shouldn't be careless. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira