from:"Todd Lipcon $Commented$ $JIRA$"

[jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API

2012-04-19 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257987#comment-13257987
 ] 

Todd Lipcon commented on HADOOP-8148:
-

Duplicating my comment from HADOOP-8258:

{quote}
In current versions of Hadoop, the read path for applications like HBase often 
looks like:

allocate a byte array for an HFile block (~64kb)
call read() into that byte array:
copy 1: read() packets from the socket into a direct buffer provided by the 
DirectBufferPool
copy 2: copy from the direct buffer pool into the provided byte[]
call setInput on a decompressor
copy 3: copy from the byte[] back to a direct buffer inside the codec 
implementation
call decompress:
JNI code accesses the input buffer and writes to the output buffer
copy 4: from the output buffer back into the byte[] for the uncompressed hfile 
block
ineffiency: HBase now does its own checksumming. Since it has to checksum the 
byte[], it can't easily use the SSE-enabled checksum path.
Given the new direct-buffer read support introduced by HDFS-2834, we can remove 
copy #2 and #3

allocate a DirectBuffer for the compressed hfile block, and one for the 
uncompressed block (we know the size from the hfile block header)
call read() into the direct buffer using the HDFS-2834 API
copy 1: read() packets from the socket into that buffer
call setInput() with that buffer. no copies necessary
call decompress:
JNI code accesses the input buffer and writes directly to the output buffer, 
with no copies
HBase now has the uncompressed block as a direct buffer. It can use the 
SSE-enabled checksum for better efficiency
This should improve the performance of HBase significantly. We may also be able 
to use the new API from within SequenceFile and other compressible file formats 
to avoid two copies from the read path.

Similar applies to the write path, but in my experience the write path is less 
often CPU-constrained, so I'd prefer to concentrate on the read path first.
{quote}

 Zero-copy ByteBuffer-based compressor / decompressor API
 

 Key: HADOOP-8148
 URL: https://issues.apache.org/jira/browse/HADOOP-8148
 Project: Hadoop Common
  Issue Type: New Feature
  Components: io
Reporter: Tim Broberg
Assignee: Tim Broberg
 Attachments: hadoop8148.patch


 Per Todd Lipcon's comment in HDFS-2834, 
   Whenever a native decompression codec is being used, ... we generally have 
 the following copies:
   1) Socket - DirectByteBuffer (in SocketChannel implementation)
   2) DirectByteBuffer - byte[] (in SocketInputStream)
   3) byte[] - Native buffer (set up for decompression)
   4*) decompression to a different native buffer (not really a copy - 
 decompression necessarily rewrites)
   5) native buffer - byte[]
   with the proposed improvement we can hopefully eliminate #2,#3 for all 
 applications, and #2,#3,and #5 for libhdfs.
 
 The interfaces in the attached patch attempt to address:
  A - Compression and decompression based on ByteBuffers (HDFS-2834)
  B - Zero-copy compression and decompression (HDFS-3051)
  C - Provide the caller a way to know how the max space required to hold 
 compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-13 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253885#comment-13253885
]

Todd Lipcon commented on HADOOP-8247:
-

bq. The problem of this jira is that it makes the auto and manual failover
exclusive to each other

Yes, this is a temporary state along the way. As discussed elsewhere, we need
to flip the manual HA commands over to communicate with the ZKFCs when
automatic failover is enabled. Since that code isn't done yet, the current
behavior is to disable manual failover.

Auto-HA: add a config to enable auto-HA, which disables manual FC
-

Key: HADOOP-8247
URL: https://issues.apache.org/jira/browse/HADOOP-8247
Project: Hadoop Common
Issue Type: Improvement
Components: auto-failover, ha
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Fix For: Auto Failover (HDFS-3042)

Attachments: hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt,
hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt

Currently, if automatic failover is set up and running, and the user uses the
haadmin -failover command, he or she can end up putting the system in an
inconsistent state, where the state in ZK disagrees with the actual state of
the world. To fix this, we should add a config flag which is used to enable
auto-HA. When this flag is set, we should disallow use of the haadmin command
to initiate failovers. We should refuse to run ZKFCs when the flag is not
set. Of course, this flag should be scoped by nameservice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8279) Auto-HA: Allow manual failover to be invoked from zkfc.

2012-04-13 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253887#comment-13253887
 ] 

Todd Lipcon commented on HADOOP-8279:
-

Thanks for filing this, Mingjie. I plan to work on it in the coming weeks.

 Auto-HA: Allow manual failover to be invoked from zkfc.
 ---

 Key: HADOOP-8279
 URL: https://issues.apache.org/jira/browse/HADOOP-8279
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ha
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Mingjie Lai
Assignee: Todd Lipcon
 Fix For: Auto Failover (HDFS-3042)


 HADOOP-8247 introduces a configure flag to prevent potential status 
 inconsistency between zkfc and namenode, by making auto and manual failover 
 mutually exclusive.
 However, as described in 2.7.2 section of design doc at HDFS-2185, we should 
 allow manual and auto failover co-exist, by:
 - adding some rpc interfaces at zkfc
 - manual failover shall be triggered by haadmin, and handled by zkfc if auto 
 failover is enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8271) PowerPc Build error.

2012-04-12 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252603#comment-13252603
 ] 

Todd Lipcon commented on HADOOP-8271:
-

Patch looks good. Can you please make a patch against trunk, as well? We'll 
want to check this in to all branches.

 PowerPc Build error.
 

 Key: HADOOP-8271
 URL: https://issues.apache.org/jira/browse/HADOOP-8271
 Project: Hadoop Common
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.2, 1.0.3
 Environment: Linux RHEL 6.1 PowerPC + IBM JVM 6.0 SR10
Reporter: Kumar Ravi
  Labels: patch
 Fix For: 1.0.3

 Attachments: HADOOP-8271.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 When attempting to build branch-1, the following error is seen and ant exits.
 [exec] configure: error: Unsupported CPU architecture powerpc64
 The following command was used to build hadoop-common
 ant -Dlibhdfs=true -Dcompile.native=true -Dfusedfs=true -Dcompile.c++=true 
 -Dforrest.home=$FORREST_HOME compile-core-native compile-c++ 
 compile-c++-examples task-controller tar record-parser compile-hdfs-classes 
 package -Djava5.home=/opt/ibm/ibm-java2-ppc64-50/ 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8198) Support multiple network interfaces

2012-04-11 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251892#comment-13251892
 ] 

Todd Lipcon commented on HADOOP-8198:
-

I agree with the above comments that tokens are starting to fall apart. But, I 
don't think this current proposal has any relation to the token issue -- Eli is 
only proposing to add multi-NIC support for datanodes, and datanodes don't have 
service tokens. They only validate block tokens, which have no associated 
host/IP/etc.

If we wanted multi-NIC on the NN RPC, the token issue would be a blocker, but I 
don't think that's the current proposal.

 Support multiple network interfaces
 ---

 Key: HADOOP-8198
 URL: https://issues.apache.org/jira/browse/HADOOP-8198
 Project: Hadoop Common
  Issue Type: New Feature
  Components: io, performance
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: MultipleNifsv1.pdf, MultipleNifsv2.pdf, 
 MultipleNifsv3.pdf


 Hadoop does not currently utilize multiple network interfaces, which is a 
 common user request, and important in enterprise environments. This jira 
 covers a proposal for enhancements to Hadoop so it better utilizes multiple 
 network interfaces. The primary motivation being improved performance, 
 performance isolation, resource utilization and fault tolerance. The attached 
 design doc covers the high-level use cases, requirements, a proposal for 
 trunk/0.23, discussion on related features, and a proposal for Hadoop 1.x 
 that covers a subset of the functionality of the trunk/0.23 proposal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8198) Support multiple network interfaces

2012-04-11 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251926#comment-13251926
 ] 

Todd Lipcon commented on HADOOP-8198:
-

bq. Are we going to confine yarn/MR services to using only one NIC?

If I recall correctly, the shuffle services use job tokens and not service 
tokens as well, right? I think it's OK to confine the RPC interfaces to using 
one NIC (for now) as they're generally not throughput-intensive. Adding 
multi-NIC support for them would be nice in the future for fault tolerance but 
I think it should be a separate task, since as you've brought up, it's much 
harder.

 Support multiple network interfaces
 ---

 Key: HADOOP-8198
 URL: https://issues.apache.org/jira/browse/HADOOP-8198
 Project: Hadoop Common
  Issue Type: New Feature
  Components: io, performance
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: MultipleNifsv1.pdf, MultipleNifsv2.pdf, 
 MultipleNifsv3.pdf


 Hadoop does not currently utilize multiple network interfaces, which is a 
 common user request, and important in enterprise environments. This jira 
 covers a proposal for enhancements to Hadoop so it better utilizes multiple 
 network interfaces. The primary motivation being improved performance, 
 performance isolation, resource utilization and fault tolerance. The attached 
 design doc covers the high-level use cases, requirements, a proposal for 
 trunk/0.23, discussion on related features, and a proposal for Hadoop 1.x 
 that covers a subset of the functionality of the trunk/0.23 proposal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8269) Fix some javadoc warnings on branch-1

2012-04-11 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13251990#comment-13251990
 ] 

Todd Lipcon commented on HADOOP-8269:
-

+1

 Fix some javadoc warnings on branch-1
 -

 Key: HADOOP-8269
 URL: https://issues.apache.org/jira/browse/HADOOP-8269
 Project: Hadoop Common
  Issue Type: Bug
  Components: documentation
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hadoop-8269.txt


 There are some javadoc warnings on branch-1, let's fix them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8262) Between mapper and reducer, Hadoop inserts spaces into my string

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249942#comment-13249942
 ] 

Todd Lipcon commented on HADOOP-8262:
-

http://hadoop.apache.org/mapreduce/mailing_lists.html has instructions on how 
to subscribe to the lists

 Between mapper and reducer, Hadoop inserts spaces into my string
 

 Key: HADOOP-8262
 URL: https://issues.apache.org/jira/browse/HADOOP-8262
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.0
 Environment: Eclipse plugin, Windows
Reporter: Adriana Sbircea

 In the mapper i send as key a number, and as value another number which has 
 more than one digit, but i send them as Text objects. In my reducer all the 
 values for a key have spaces between every digit of a value. I can't do my 
 task because of this problem. 
 I don't use combiners or something else. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8152) Expand public APIs for security library classes

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250163#comment-13250163
 ] 

Todd Lipcon commented on HADOOP-8152:
-

I generally agree that the static loginUser concept is a mess and should 
probably be killed in favor of using methods like 
{{loginFromKeytabAndReturnUGI}} everywhere. But I also agree with Aaron that we 
can mark these as evolving and it doesn't force our hand down the road.

 Expand public APIs for security library classes
 ---

 Key: HADOOP-8152
 URL: https://issues.apache.org/jira/browse/HADOOP-8152
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Affects Versions: 2.0.0
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HADOOP-8152.patch, HADOOP-8152.patch


 Currently projects like Hive and HBase use UserGroupInformation and 
 SecurityUtil methods. Both of these classes are marked 
 LimitedPrivate(HDFS,MR) but should probably be marked more generally public.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8248) Clarify bylaws about review-then-commit policy

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250176#comment-13250176
]

Todd Lipcon commented on HADOOP-8248:
-

bq. For a join work of multiple committers, all of the authors cannot review
the patch for significant patches.

My thinking here is that it's fine if one committer does some minor fixup or
adds test cases to a patch that another authored. For example, if I start a
patch, but don't get time to finish the unit tests, and you help out by adding
a test, I think it's OK for you to commit it assuming I +1 your addition. Put
another way, any given chunk of the patch should be reviewed by a committer
who didn't write it.

I don't want to get too pedantic about it, though -- IMO it's the spirit that's
important. Code reviews are important for spotting mistakes, and it's hard to
spot your own mistakes. So any piece of code should be +1ed at by an expert (ie
committer) who didn't write that bit of code.

bq. For merging from a branch, the three +1's cannot be cast from any of the
committers who worked on the branch.

I disagree on this -- my assumption is that all of the patches on the branch
have been reviewed according to the above policy, so everything's been looked
at by someone who didn't write it. In my mind, the +1s on the merge are
basically a commitment to stand by the work to be merged and an assertion that
you think it is good code, a good feature, etc. If the development on the
branch looks shoddy/sketchy/whatever, then there's plenty of opportunity for
other committers to -1 it.

Perhaps we should add a 3-day minimum voting period for branch merges to trunk
when that branch didn't follow the normal RTC guidelines?

Clarify bylaws about review-then-commit policy
--

Key: HADOOP-8248
URL: https://issues.apache.org/jira/browse/HADOOP-8248
Project: Hadoop Common
Issue Type: Task
Reporter: Todd Lipcon
Attachments: c8248_20120409.patch, proposed-bylaw-change.txt

As discussed on the mailing list (thread Requirements for patch review
4/4/2012) we should clarify the bylaws with respect to the review-then-commit
policy. This JIRA is to agree on the proposed change.

[jira] [Commented] (HADOOP-8248) Clarify bylaws about review-then-commit policy

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250179#comment-13250179
 ] 

Todd Lipcon commented on HADOOP-8248:
-

To add a little more: I think the requirement of 3 committer +1s from people 
who didn't work on the branch will make it really hard to ever merge branches. 
Looking, for example, at the recent HA branch merge, it listed the following 
people as patch contributors:
bq. Contributed by Todd Lipcon, Aaron T. Myers, Eli Collins, Uma Maheswara Rao 
G, Bikas Saha, Suresh Srinivas, Jitendra Nath Pandey, Hari Mankude, Brandon Li, 
Sanjay Radia, Mingjie Lai, and Gregory Chanan

Finding 3 active committers who are not on that list and are knowledgeable 
about NN internals would have been very difficult. In fact of the committers 
who did +1 the merge, you're the only one who isn't in the above list :)

 Clarify bylaws about review-then-commit policy
 --

 Key: HADOOP-8248
 URL: https://issues.apache.org/jira/browse/HADOOP-8248
 Project: Hadoop Common
  Issue Type: Task
Reporter: Todd Lipcon
 Attachments: c8248_20120409.patch, proposed-bylaw-change.txt


 As discussed on the mailing list (thread Requirements for patch review 
 4/4/2012) we should clarify the bylaws with respect to the review-then-commit 
 policy. This JIRA is to agree on the proposed change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250271#comment-13250271
]

Todd Lipcon commented on HADOOP-8247:
-

Hi Hari. That's specifically the point of the FORCEMANUAL flag. It is not safe
to use it with automatic failover. So, the user has to accept the warning and
acknowledge they're about to do something dumb, that _will_ break auto failover
if the ZKFCs are running.

The purpose of allowing it at all is to give a recourse for an expert admin if
their ZK cluster has crashed and they need to manually do a failover in an
emergency situation. Its use is highly discouraged. The warning printed is:
{code}
--forceManual allows the manual failover commands to be used\n +
even when automatic failover is enabled. This\n +
flag is DANGEROUS and should only be used with\n +
expert guidance.);
{code}

Auto-HA: add a config to enable auto-HA, which disables manual FC
-

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250272#comment-13250272
]

Todd Lipcon commented on HADOOP-8247:
-

P.S. if you'd like I'd be happy to rename it to something even scarier
sounding... like --dangerous-manual-override, or whatever you prefer.

Auto-HA: add a config to enable auto-HA, which disables manual FC
-

[jira] [Commented] (HADOOP-8248) Clarify bylaws about review-then-commit policy

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250291#comment-13250291
 ] 

Todd Lipcon commented on HADOOP-8248:
-

bq. Which above policy? Branches can use RTC, or whatever they decide upon. 
Therefore it is possible that the branch content has not actually been reviewed 
by another committer before merging.

Right, that's why I also added: Perhaps we should add a 3-day minimum voting 
period for branch merges to trunk when that branch didn't follow the normal RTC 
guidelines?

 Clarify bylaws about review-then-commit policy
 --

 Key: HADOOP-8248
 URL: https://issues.apache.org/jira/browse/HADOOP-8248
 Project: Hadoop Common
  Issue Type: Task
Reporter: Todd Lipcon
 Attachments: c8248_20120409.patch, proposed-bylaw-change.txt


 As discussed on the mailing list (thread Requirements for patch review 
 4/4/2012) we should clarify the bylaws with respect to the review-then-commit 
 policy. This JIRA is to agree on the proposed change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250294#comment-13250294
 ] 

Todd Lipcon commented on HADOOP-8247:
-

I also ran the manual tests again. Here's the usage output of HAAdmin:

{code}
Usage: DFSHAAdmin [-ns nameserviceId]
[-transitionToActive [--forcemanual] serviceId]
[-transitionToStandby [--forcemanual] serviceId]
[-failover [--forcefence] [--forceactive] [--forcemanual] serviceId 
serviceId]
[-getServiceState serviceId]
[-checkHealth serviceId]
[-help command]

  --forceManual allows the manual failover commands to be used
even when automatic failover is enabled. This
flag is DANGEROUS and should only be used with
expert guidance.
{code}

Here's what happens if I try to use a state change command with auto-HA enabled:

{code}
$ ./bin/hdfs haadmin -transitionToActive nn1
Automatic failover is enabled for NameNode at todd-w510/127.0.0.1:8021
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please 
specify the forcemanual flag.
$ echo $?
255
{code}

Also checked the other two state-changing ops (transitionToStandby and 
failover) and they yielded the same error message.


- I verified that {{-getServiceState}} and {{-checkHealth}} continue to work.

- I verified that the -forceManual flag worked:

{code}
$ ./bin/hdfs haadmin -transitionToStandby -forcemanual nn1
12/04/09 16:12:38 WARN ha.HAAdmin: Proceeding with manual HA state management 
even though
automatic failover is enabled for NameNode at todd-w510/127.0.0.1:8021
{code}
(also for -transitionToActive and -failover)

- Verified that {{start-dfs.sh}} starts the ZKFCs on both of my configured NNs 
when auto-HA is enabled. Also verified {{stop-dfs.sh}} stops the ZKFCs. 
Discovered trivial bug HDFS-3234 here.



Next, I modified my config to set the auto failover flag to false.

- verified that start-dfs.sh doesn't try to start ZKFCs.
- verified that if I try to start a ZKFC, it bails:
{code}
12/04/09 16:19:12 INFO tools.DFSZKFailoverController: Failover controller 
configured for NameNode nameserviceId1.nn2
12/04/09 16:19:12 FATAL ha.ZKFailoverController: Automatic failover is not 
enabled for NameNode at todd-w510/127.0.0.1:8022. Please ensure that automatic 
failover is enabled in the configuration before running the ZK failover 
controller.
{code}

- verified that the haadmin commands all function without any {{-forcemanual}} 
flag specified.


 Auto-HA: add a config to enable auto-HA, which disables manual FC
 -

 Key: HADOOP-8247
 URL: https://issues.apache.org/jira/browse/HADOOP-8247
 Project: Hadoop Common
  Issue Type: Improvement
  Components: auto-failover, ha
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8247.txt, hadoop-8247.txt, hadoop-8247.txt, 
 hadoop-8247.txt


 Currently, if automatic failover is set up and running, and the user uses the 
 haadmin -failover command, he or she can end up putting the system in an 
 inconsistent state, where the state in ZK disagrees with the actual state of 
 the world. To fix this, we should add a config flag which is used to enable 
 auto-HA. When this flag is set, we should disallow use of the haadmin command 
 to initiate failovers. We should refuse to run ZKFCs when the flag is not 
 set. Of course, this flag should be scoped by nameservice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-09 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250307#comment-13250307
]

Todd Lipcon commented on HADOOP-8247:
-

bq. There are always admins who disregard these warnings

I think they deserve what they get... admins can also decide to run rm -Rf
/my/metadata/dir and get into a bad state.

bq. Instead, wouldn't it be better to come up with a set of procedures to
unwedge the cluster, starting with setting auto-failover key to false,
resetting NNs and using manual failover

Assumedly you want to be able to do this without incurring downtime. Certainly
if downtime is acceptable, that would be the right response.. But still I think
having a manual override here is useful for advanced operators who need to use
it in an extenuating circumstance.

As I said above, I'm OK giving it a scarier name and/or making it prompt for
confirmation upon use, with a scary warning message. I'm even OK removing it
from the documentation, so people aren't lured into using it when they don't
really know what they're doing.

Auto-HA: add a config to enable auto-HA, which disables manual FC
-

[jira] [Commented] (HADOOP-8152) Expand public APIs for security library classes

2012-04-08 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249662#comment-13249662
 ] 

Todd Lipcon commented on HADOOP-8152:
-

Looking at HBase, it seems like it's also using the following which aren't 
marked public by this patch:
- SecurityUtil.getServerPrincipal
- enum UGI.AuthenticationMethod (marked evolving but not marked public)
- UGI.getRealUser
- UGI.isLoginKeytabBased
- UGI.reloginFromKeytab
- UGI.reloginFromTicketCache
- UGI.getUserName
- UGI.createUserForTesting

 Expand public APIs for security library classes
 ---

 Key: HADOOP-8152
 URL: https://issues.apache.org/jira/browse/HADOOP-8152
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Affects Versions: 2.0.0
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HADOOP-8152.patch


 Currently projects like Hive and HBase use UserGroupInformation and 
 SecurityUtil methods. Both of these classes are marked 
 LimitedPrivate(HDFS,MR) but should probably be marked more generally public.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8261) Har file system doesn't deal with FS URIs with a host but no port

2012-04-08 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249663#comment-13249663
 ] 

Todd Lipcon commented on HADOOP-8261:
-

Nit: spurious 'a' here at the end of the sentence
{code}
+   * port specified, as is often the case with an HA setup.a
{code}

Another nit: I think the test case should be capitalized WithHA instead of 
WithHa to match our other test cases which all have the keyword HA in them 
(makes it easy to run mvn test '-Dtest=*HA*')

+1 once you fix these

 Har file system doesn't deal with FS URIs with a host but no port
 -

 Key: HADOOP-8261
 URL: https://issues.apache.org/jira/browse/HADOOP-8261
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs
Affects Versions: 2.0.0
Reporter: Aaron T. Myers
Assignee: Aaron T. Myers
 Attachments: HADOOP-8261-with-test-in-HDFS.patch, HADOOP-8261.patch


 If you try to run an MR job with a Hadoop Archive as the input, but the URI 
 you give it has no port specified (e.g. hdfs://simon) the job will fail 
 with an error like the following:
 {noformat}
 java.io.IOException: Incomplete HDFS URI, no host: 
 hdfs://simon:-1/user/atm/input.har/input
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248746#comment-13248746
 ] 

Todd Lipcon commented on HADOOP-8247:
-

I added a struct because I figured we may want to add more fields in the future 
that fulfill a similar purpose. For example, I can imagine that a failover 
event might be tagged with a string reason field -- sort of like how the 
Linux shutdown command can take a message. This would just be logged on the 
NN side. Another example is the proposed fix for HADOOP-8217, where we need to 
add an epoch number to the failover requests to get an ordering of failover 
events.

 Auto-HA: add a config to enable auto-HA, which disables manual FC
 -

 Key: HADOOP-8247
 URL: https://issues.apache.org/jira/browse/HADOOP-8247
 Project: Hadoop Common
  Issue Type: Improvement
  Components: auto-failover, ha
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8247.txt


 Currently, if automatic failover is set up and running, and the user uses the 
 haadmin -failover command, he or she can end up putting the system in an 
 inconsistent state, where the state in ZK disagrees with the actual state of 
 the world. To fix this, we should add a config flag which is used to enable 
 auto-HA. When this flag is set, we should disallow use of the haadmin command 
 to initiate failovers. We should refuse to run ZKFCs when the flag is not 
 set. Of course, this flag should be scoped by nameservice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248918#comment-13248918
 ] 

Todd Lipcon commented on HADOOP-8247:
-

Hi Hari. JIRA doesn't support cross-project subtasks. You can use the following 
filter to track all auto-HA related tasks: 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hiderequestId=12319482
  (let me know if the link doesn't work, I think I set it up to be world-shared)

 Auto-HA: add a config to enable auto-HA, which disables manual FC
 -

 Key: HADOOP-8247
 URL: https://issues.apache.org/jira/browse/HADOOP-8247
 Project: Hadoop Common
  Issue Type: Improvement
  Components: auto-failover, ha
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8247.txt


 Currently, if automatic failover is set up and running, and the user uses the 
 haadmin -failover command, he or she can end up putting the system in an 
 inconsistent state, where the state in ZK disagrees with the actual state of 
 the world. To fix this, we should add a config flag which is used to enable 
 auto-HA. When this flag is set, we should disallow use of the haadmin command 
 to initiate failovers. We should refuse to run ZKFCs when the flag is not 
 set. Of course, this flag should be scoped by nameservice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8257) Auto-HA: TestZKFailoverControllerStress occasionally fails with Mockito error

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248928#comment-13248928
]

Todd Lipcon commented on HADOOP-8257:
-

Jenkins won't run on this since it's on a branch. I verified by changing the
test runtime to 3 seconds and looping it. Without the patch, it failed with the
mockito error after 3 or 4 minutes. I then looped with the patch for 15 minutes
without a failure.

Auto-HA: TestZKFailoverControllerStress occasionally fails with Mockito error
-

Key: HADOOP-8257
URL: https://issues.apache.org/jira/browse/HADOOP-8257
Project: Hadoop Common
Issue Type: Bug
Components: auto-failover, test
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Trivial
Attachments: hadoop-8257.txt

Once in a while I've seen the following in TestZKFailoverControllerStress:
Unfinished stubbing detected here: - at
org.apache.hadoop.ha.TestZKFailoverControllerStress.testRandomHealthAndDisconnects(TestZKFailoverControllerStress.java:118)
E.g. thenReturn() may be missing
This is because we set up the mock answers _after_ starting the ZKFCs. So if
the ZKFC calls the mock object while it's in the middle of the setup, this
exception occurs.

[jira] [Commented] (HADOOP-8258) Add interfaces for compression codecs to use direct byte buffers

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249035#comment-13249035
 ] 

Todd Lipcon commented on HADOOP-8258:
-

In current versions of Hadoop, the read path for applications like HBase often 
looks like:

- allocate a byte array for an HFile block (~64kb)
- call read() into that byte array:
-- copy 1: read() packets from the socket into a direct buffer provided by the 
DirectBufferPool
-- copy 2: copy from the direct buffer pool into the provided byte[]
- call setInput on a decompressor
-- copy 3: copy from the byte[] back to a direct buffer inside the codec 
implementation
- call decompress:
-- JNI code accesses the input buffer and writes to the output buffer
-- copy 4: from the output buffer back into the byte[] for the uncompressed 
hfile block
-- ineffiency: HBase now does its own checksumming. Since it has to checksum 
the byte[], it can't easily use the SSE-enabled checksum path.


Given the new direct-buffer read support introduced by HDFS-2834, we can remove 
copy #2 and #3

- allocate a DirectBuffer for the compressed hfile block, and one for the 
uncompressed block (we know the size from the hfile block header)
- call read() into the direct buffer using the HDFS-2834 API
-- copy 1: read() packets from the socket into that buffer
- call setInput() with that buffer. no copies necessary
- call decompress:
-- JNI code accesses the input buffer and writes directly to the output buffer, 
with no copies
- HBase now has the uncompressed block as a direct buffer. It can use the 
SSE-enabled checksum for better efficiency

This should improve the performance of HBase significantly. We may also be able 
to use the new API from within SequenceFile and other compressible file formats 
to avoid two copies from the read path.

Similar applies to the write path, but in my experience the write path is less 
often CPU-constrained, so I'd prefer to concentrate on the read path first.

 Add interfaces for compression codecs to use direct byte buffers
 

 Key: HADOOP-8258
 URL: https://issues.apache.org/jira/browse/HADOOP-8258
 Project: Hadoop Common
  Issue Type: New Feature
  Components: io, native, performance
Affects Versions: 3.0.0
Reporter: Todd Lipcon

 Currently, the codec interface only provides input/output functions based on 
 byte arrays. Given that most of the codecs are implemented in native code, 
 this necessitates two extra copies - one to copy the input data to a direct 
 buffer, and one to copy the output data back to a byte array. We should add 
 interfaces to Decompressor/Compressor that can work directly with direct byte 
 buffers to avoid these copies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8258) Add interfaces for compression codecs to use direct byte buffers

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249086#comment-13249086
 ] 

Todd Lipcon commented on HADOOP-8258:
-

Ah, thanks, sorry I missed that. Do you think this JIRA should just be marked 
as duplicate? I can reproduce the comments into the other one.

 Add interfaces for compression codecs to use direct byte buffers
 

 Key: HADOOP-8258
 URL: https://issues.apache.org/jira/browse/HADOOP-8258
 Project: Hadoop Common
  Issue Type: New Feature
  Components: io, native, performance
Affects Versions: 3.0.0
Reporter: Todd Lipcon

 Currently, the codec interface only provides input/output functions based on 
 byte arrays. Given that most of the codecs are implemented in native code, 
 this necessitates two extra copies - one to copy the input data to a direct 
 buffer, and one to copy the output data back to a byte array. We should add 
 interfaces to Decompressor/Compressor that can work directly with direct byte 
 buffers to avoid these copies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249097#comment-13249097
]

Todd Lipcon commented on HADOOP-8247:
-

bq. Can we make this simpler by not supporting manual failover?

Yes. That's the current version of the patch - if you enable automatic, then
you don't get manual. But, as described in the design doc in HDFS-2185, there
are good reasons to support manually initiated failover even when the system is
set up for automatic. That will be done separately as a followup. This patch is
just meant for safety purposes.

Another advantage of this patch is that we can amend the start-dfs.sh script
to automatically start ZKFCs when the conf flag is present. My next rev will do
this.

Auto-HA: add a config to enable auto-HA, which disables manual FC
-

[jira] [Commented] (HADOOP-8086) KerberosName silently sets defaultRealm to if the Kerberos config is not found, it should log a WARN

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247410#comment-13247410
 ] 

Todd Lipcon commented on HADOOP-8086:
-

This patch seems to use slf4j, whereas we use commons-logging elsewhere. Is 
this something particular to the hadoop-auth component? Or just a mistake?

 KerberosName silently sets defaultRealm to  if the Kerberos config is not 
 found, it should log a WARN
 ---

 Key: HADOOP-8086
 URL: https://issues.apache.org/jira/browse/HADOOP-8086
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Affects Versions: 0.23.2, 0.24.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Minor
 Fix For: 0.23.2

 Attachments: HADOOP-8086.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8086) KerberosName silently sets defaultRealm to if the Kerberos config is not found, it should log a WARN

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247488#comment-13247488
 ] 

Todd Lipcon commented on HADOOP-8086:
-

OK. I will pretend that that makes sense, and give a +1 for this patch then.

 KerberosName silently sets defaultRealm to  if the Kerberos config is not 
 found, it should log a WARN
 ---

 Key: HADOOP-8086
 URL: https://issues.apache.org/jira/browse/HADOOP-8086
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Affects Versions: 0.23.2, 0.24.0
Reporter: Alejandro Abdelnur
Assignee: Alejandro Abdelnur
Priority: Minor
 Fix For: 0.23.2

 Attachments: HADOOP-8086.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-6941) Support non-SUN JREs in UserGroupInformation

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248011#comment-13248011
 ] 

Todd Lipcon commented on HADOOP-6941:
-

Looks like this patch broke the original security support. See HADOOP-8251.

 Support non-SUN JREs in UserGroupInformation
 

 Key: HADOOP-6941
 URL: https://issues.apache.org/jira/browse/HADOOP-6941
 Project: Hadoop Common
  Issue Type: Bug
 Environment: SLES 11, Apache Harmony 6 and SLES 11, IBM Java 6
Reporter: Stephen Watt
Assignee: Luke Lu
 Fix For: 1.0.3, 2.0.0

 Attachments: 6941-1.patch, 6941-branch1.patch, HADOOP-6941.patch, 
 hadoop-6941.patch


 Attempting to format the namenode or attempting to start Hadoop using Apache 
 Harmony or the IBM Java JREs results in the following exception:
 10/09/07 16:35:05 ERROR namenode.NameNode: java.lang.NoClassDefFoundError: 
 com.sun.security.auth.UnixPrincipal
   at 
 org.apache.hadoop.security.UserGroupInformation.clinit(UserGroupInformation.java:223)
   at java.lang.J9VMInternals.initializeImpl(Native Method)
   at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setConfigurationParameters(FSNamesystem.java:420)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:391)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1240)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)
 Caused by: java.lang.ClassNotFoundException: 
 com.sun.security.auth.UnixPrincipal
   at java.net.URLClassLoader.findClass(URLClassLoader.java:421)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:652)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:346)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:618)
   ... 8 more
 This is a negative regression as previous versions of Hadoop worked with 
 these JREs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8251) SecurityUtil.fetchServiceTicket broken after HADOOP-6941

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248012#comment-13248012
 ] 

Todd Lipcon commented on HADOOP-8251:
-

The bug was simple -- the string used for the name of the Krb5Util class was 
mistakenly just the package name instead of the class name.

It looks like the IBM implementation has the same bug, but googling around, I 
don't think there even _is_ a Krb5Util class in IBM's library, at least not 
with the functions we need. So I am skeptical that security support works when 
running on the IBM JRE.

 SecurityUtil.fetchServiceTicket broken after HADOOP-6941
 

 Key: HADOOP-8251
 URL: https://issues.apache.org/jira/browse/HADOOP-8251
 Project: Hadoop Common
  Issue Type: Bug
  Components: security
Affects Versions: 1.1.0, 2.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Attachments: hadoop-8251.txt


 HADOOP-6941 replaced direct references to some classes with reflective access 
 so as to support other JDKs. Unfortunately there was a mistake in the name of 
 the Krb5Util class, which broke fetchServiceTicket. This manifests itself as 
 the inability to run checkpoints or other krb5-SSL HTTP-based transfers:
 java.lang.ClassNotFoundException: sun.security.jgss.krb5

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248019#comment-13248019
]

Todd Lipcon commented on HADOOP-8247:
-

I should of course note that this is only the first step. After this is
committed, the idea is to make the haadmin -failover command line work in
coordination with the ZKFC daemons to do a controlled failover. But in the
meantime, it's disallowed so that users can't shoot themselves in the foot by
running this command.

Auto-HA: add a config to enable auto-HA, which disables manual FC
-

[jira] [Commented] (HADOOP-7211) Security uses proprietary Sun APIs

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248032#comment-13248032
 ] 

Todd Lipcon commented on HADOOP-7211:
-

HADOOP-6941 fixed compilation on the IBM JDK using reflection, but added some 
code which definitely does not work - eg:
{code}
  if (System.getProperty(java.vendor).contains(IBM)) {
principalClass = Class.forName(com.ibm.security.krb5.PrincipalName);

credentialsClass = Class.forName(com.ibm.security.krb5.Credentials);
krb5utilClass = Class.forName(com.ibm.security.jgss.mech.krb5);
{code}

but the krb5utilClass here is invalid, and there doesn't appear to be any 
equivalent in the IBM JDK. Instead of this code which kind of looks like it 
should work, we should just throw an UnsupportedOperationException until 
someone actually fixes this.

 Security uses proprietary Sun APIs
 --

 Key: HADOOP-7211
 URL: https://issues.apache.org/jira/browse/HADOOP-7211
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Reporter: Eli Collins
Assignee: Luke Lu

 The security code uses the KrbException, Credentials, and PrincipalName 
 classes from sun.security.krb5 and Krb5Util from sun.security.jgss.krb5. 
 These may disappear in future Java releases. Also Hadoop does not compile 
 using jdks that do not support them, for example with the following IBM JDK.
 {noformat}
 $ /home/eli/toolchain/java-x86_64-60/bin/java -version
 java version 1.6.0
 Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1))
 IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 
 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled)
 J9VM - 20110203_074623
 JIT  - r9_20101028_17488ifx3
 GC   - 20101027_AA)
 JCL  - 20110203_01
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8251) SecurityUtil.fetchServiceTicket broken after HADOOP-6941

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248054#comment-13248054
 ] 

Todd Lipcon commented on HADOOP-8251:
-

Hey Devaraj. Sorry, I already committed this, and I don't feel comfortable 
changing the code if I can't test it (I don't have ready access to an IBM JDK 
installation). I think rather than just fixing this bug, someone should run 
through the whole security test plan on the IBM JDK -- perhaps as part of 
HADOOP-7211?

bq. Seems like the methods are there and with the desired signatures..
bq. Did I miss something?

I was basing it on these docs:
http://www.ibm.com/developerworks/java/jdk/security/60/secguides/jgssDocs/api/index.html?com/ibm/security/jgss/mech/krb5/Krb5RealmUtil.html
which don't mention krb5util in that package

 SecurityUtil.fetchServiceTicket broken after HADOOP-6941
 

 Key: HADOOP-8251
 URL: https://issues.apache.org/jira/browse/HADOOP-8251
 Project: Hadoop Common
  Issue Type: Bug
  Components: security
Affects Versions: 1.1.0, 2.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
 Fix For: 1.0.3, 1.1.0, 2.0.0

 Attachments: hadoop-8251-b1.txt, hadoop-8251.txt


 HADOOP-6941 replaced direct references to some classes with reflective access 
 so as to support other JDKs. Unfortunately there was a mistake in the name of 
 the Krb5Util class, which broke fetchServiceTicket. This manifests itself as 
 the inability to run checkpoints or other krb5-SSL HTTP-based transfers:
 java.lang.ClassNotFoundException: sun.security.jgss.krb5

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8251) SecurityUtil.fetchServiceTicket broken after HADOOP-6941

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248078#comment-13248078
]

Todd Lipcon commented on HADOOP-8251:
-

bq. Please have at least one simple test that fails without the patch.

I'm just fixing what the previous patch broke. I don't have time to write a
test, since this depends on security infrastructure, etc, and I can't get that
to work right (see my comment on HDFS-3016). The original patch should have had
a test, I agree. But my options were to revert that patch, or just fix it, so I
did the latter without a test.

SecurityUtil.fetchServiceTicket broken after HADOOP-6941

Key: HADOOP-8251
URL: https://issues.apache.org/jira/browse/HADOOP-8251
Project: Hadoop Common
Issue Type: Bug
Components: security
Affects Versions: 1.1.0, 2.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Blocker
Fix For: 1.0.3, 1.1.0, 2.0.0

Attachments: hadoop-8251-b1.txt, hadoop-8251.txt

HADOOP-6941 replaced direct references to some classes with reflective access
so as to support other JDKs. Unfortunately there was a mistake in the name of
the Krb5Util class, which broke fetchServiceTicket. This manifests itself as
the inability to run checkpoints or other krb5-SSL HTTP-based transfers:
java.lang.ClassNotFoundException: sun.security.jgss.krb5

[jira] [Commented] (HADOOP-7211) Security uses proprietary Sun APIs

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248081#comment-13248081
 ] 

Todd Lipcon commented on HADOOP-7211:
-

bq. This jira is incorporated by the patches in HADOOP-6941 and HADOOP-7211

Did you mean another JIRA? This _is_ HADOOP-7211

 Security uses proprietary Sun APIs
 --

 Key: HADOOP-7211
 URL: https://issues.apache.org/jira/browse/HADOOP-7211
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Reporter: Eli Collins
Assignee: Luke Lu

 The security code uses the KrbException, Credentials, and PrincipalName 
 classes from sun.security.krb5 and Krb5Util from sun.security.jgss.krb5. 
 These may disappear in future Java releases. Also Hadoop does not compile 
 using jdks that do not support them, for example with the following IBM JDK.
 {noformat}
 $ /home/eli/toolchain/java-x86_64-60/bin/java -version
 java version 1.6.0
 Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1))
 IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 
 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled)
 J9VM - 20110203_074623
 JIT  - r9_20101028_17488ifx3
 GC   - 20101027_AA)
 JCL  - 20110203_01
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7211) Security uses proprietary Sun APIs

2012-04-05 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248084#comment-13248084
 ] 

Todd Lipcon commented on HADOOP-7211:
-

I don't think this JIRA should be marked as duplicate, because clearly 
HADOOP-6941 wasn't thoroughly tested. As soon as I tried running a cluster with 
a 2NN I found that it didn't work. So I'm skeptical that there isn't more work 
to do...

 Security uses proprietary Sun APIs
 --

 Key: HADOOP-7211
 URL: https://issues.apache.org/jira/browse/HADOOP-7211
 Project: Hadoop Common
  Issue Type: Improvement
  Components: security
Reporter: Eli Collins
Assignee: Luke Lu

 The security code uses the KrbException, Credentials, and PrincipalName 
 classes from sun.security.krb5 and Krb5Util from sun.security.jgss.krb5. 
 These may disappear in future Java releases. Also Hadoop does not compile 
 using jdks that do not support them, for example with the following IBM JDK.
 {noformat}
 $ /home/eli/toolchain/java-x86_64-60/bin/java -version
 java version 1.6.0
 Java(TM) SE Runtime Environment (build pxa6460sr9fp1-20110208_03(SR9 FP1))
 IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 
 jvmxa6460sr9-20110203_74623 (JIT enabled, AOT enabled)
 J9VM - 20110203_074623
 JIT  - r9_20101028_17488ifx3
 GC   - 20101027_AA)
 JCL  - 20110203_01
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8007) HA: use substitution token for fencing argument

2012-04-04 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246113#comment-13246113
 ] 

Todd Lipcon commented on HADOOP-8007:
-

bq. org.apache.hadoop.ha.TestZKFailoverController
This failure was the JMXEnv issue tracked in HADOOP-8245.

I will commit this momentarily

 HA: use substitution token for fencing argument
 ---

 Key: HADOOP-8007
 URL: https://issues.apache.org/jira/browse/HADOOP-8007
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ha
Affects Versions: 2.0.0
Reporter: Aaron T. Myers
Assignee: Todd Lipcon
 Attachments: hadoop-8007.txt, hadoop-8007.txt


 Per HADOOP-7983 currently the fencer always passes the target host:port to 
 fence as the first argument to the fence script, it would be better to use a 
 substitution token. That is to say, the user would configure myfence.sh 
 $TARGETHOST foo bar and Hadoop would substitute the target. This would allow 
 use of pre-existing scripts that might have a different ordering of arguments 
 without a wrapper.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8215) Security support for ZK Failover controller

2012-04-03 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245892#comment-13245892
 ] 

Todd Lipcon commented on HADOOP-8215:
-

I'll commit this momentarily to the branch based on ATM's above +1, since the 
review feedback changes were mostly cosmetic. I ran the ZKFC and HAAdmin tests 
locally for both common and HDFS and they passed.

 Security support for ZK Failover controller
 ---

 Key: HADOOP-8215
 URL: https://issues.apache.org/jira/browse/HADOOP-8215
 Project: Hadoop Common
  Issue Type: Improvement
  Components: auto-failover, ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Fix For: Auto Failover (HDFS-3042)

 Attachments: hadoop-8215.txt, hadoop-8215.txt


 To keep the initial patches manageable, kerberos security is not currently 
 supported in the ZKFC implementation. This JIRA is to support the following 
 important pieces for security:
 - integrate with ZK authentication (kerberos or password-based)
 - allow the user to configure ACLs for the relevant znodes
 - add keytab configuration and login to the ZKFC daemons
 - ensure that the RPCs made by the health monitor and failover controller 
 properly authenticate to the target daemons

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8245) Fix flakiness in TestZKFailoverController

2012-04-03 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245991#comment-13245991
]

Todd Lipcon commented on HADOOP-8245:
-

For problem #1, the solution is the same as is already done in some other test
cases. We just need to add a workaround to clear the ZK MBeans before running
the tearDown method. It's a hack, but in the absense of a fix for
ZOOKEEPER-1438, it's about all we can do.

I spent some time investigating problem #2. The bug is as follows:
- these test cases create a new ActiveStandbyElector, and call
{{ActiveStandbyElector.ensureBaseNode()}} on it before running the main body of
the tests. Although they don't call {{joinElection()}}, the creation of the
elector does create a {{zkClient}} object with an associated Watcher.
- in the {{testZookeeperFailure}} test case, we shut down and restart ZK. This
causes the above Watcher instance to fire its Disconnected and then Connected
events. There was a bug in the handling of the Connected event that would cause
it to re-monitor the lock znode regardless of whether it was previously in the
election.
- So, when ZK comes back up, there was not two but *three* electors racing for
the lock. However, two of the electors actually corresponded to the same dummy
service. In some cases this race would be resolved in such a way that the test
timed out.

I don't think this is a problem in practice, since the formatZK call runs in
its own JVM in the current code. However, it's worth fixing to get the tests to
not be flaky, and to have a more reasonable behavior. There are several fixes
to be done:
- Add extra asserts for {{wantToBeInElection}} to catch cases where we might
accidentally re-join the election when we weren't supposed to be in it.
- Fix the handling of the Connected event to only re-join if the elector
wants to be in the election
- Cause exceptions thrown by watcher callbacks to be propagated back as fatal
errors

Will post a patch momentarily.

Fix flakiness in TestZKFailoverController
-

Key: HADOOP-8245
URL: https://issues.apache.org/jira/browse/HADOOP-8245
Project: Hadoop Common
Issue Type: Bug
Components: auto-failover, ha
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor

When I loop TestZKFailoverController, I occasionally see two types of
failures:
1) the ZK JMXEnv issue (ZOOKEEPER-1438)
2) TestZKFailoverController.testZooKeeperFailure fails with a timeout
This JIRA is for fixes for these issues.

[jira] [Commented] (HADOOP-8210) Common side of HDFS-3148

2012-04-02 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244403#comment-13244403
 ] 

Todd Lipcon commented on HADOOP-8210:
-

+1

 Common side of HDFS-3148
 

 Key: HADOOP-8210
 URL: https://issues.apache.org/jira/browse/HADOOP-8210
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: io, performance
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hadoop-8210.txt, hadoop-8210.txt


 Common side of HDFS-3148, add necessary DNS and NetUtils methods. Test 
 coverage is in the HDFS-3148 patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8243) Security support broken in CLI (manual) failover controller

2012-04-02 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244625#comment-13244625
 ] 

Todd Lipcon commented on HADOOP-8243:
-

I should note I also ran TestDFSHAAdmin and TestDFSHAAdminMiniCluster against 
this common patch, and they both passed.

 Security support broken in CLI (manual) failover controller
 ---

 Key: HADOOP-8243
 URL: https://issues.apache.org/jira/browse/HADOOP-8243
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha, security
Affects Versions: 2.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8243.txt


 Some recent refactoring accidentally caused the proxies in some places to get 
 created with a default Configuration, instead of using the Configuration set 
 up by the DFSHAAdmin tool. This causes the HAServiceProtocol to be missing 
 the configuration which specifies the NN principle -- and thus breaks the CLI 
 HAAdmin tool in secure setups.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8215) Security support for ZK Failover controller

2012-04-02 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244654#comment-13244654
]

Todd Lipcon commented on HADOOP-8215:
-

I'm starting to work on this. Here's the plan:

bq. integrate with ZK authentication (kerberos or password-based)
Based on https://github.com/ekoontz/zookeeper/wiki and
http://hbase.apache.org/configuration.html#zk.sasl.auth it looks like the SASL
setup is a bit complicated, though entirely configuration based. I think for a
first pass we should be OK to just use password-based authentication for ZK. I
think this is sufficient because we have a well-defined set of clients that
need to access these znodes, and they don't contain any content that needs to
be encrypted over the wire. We can add SASL support later.

bq. allow the user to configure ACLs for the relevant znodes

This is reasonably straightforward - just needs some additional configuration
keys to specify the ACL, and then tying it in to where we create the znodes.

bq. add keytab configuration and login to the ZKFC daemons

I think it should be OK to re-use the namenode principals here. That simplifies
deployment since it avoids having to add new principals to the KDC, and given
that the ZKFCs are intended to run on the same machines as the NNs, they will
have access to the keytab files by default. Please speak up if you think we
need separate keytabs/principals for the ZKFC daemons.

bq. ensure that the RPCs made by the health monitor and failover controller
properly authenticate to the target daemons
This is just a matter of making sure we set up the target principal in the
Configuration, and do the proper login/doAs before we start the main ZKFC code.

Security support for ZK Failover controller
---

Key: HADOOP-8215
URL: https://issues.apache.org/jira/browse/HADOOP-8215
Project: Hadoop Common
Issue Type: Improvement
Components: auto-failover, ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical

To keep the initial patches manageable, kerberos security is not currently
supported in the ZKFC implementation. This JIRA is to support the following
important pieces for security:
- integrate with ZK authentication (kerberos or password-based)
- allow the user to configure ACLs for the relevant znodes
- add keytab configuration and login to the ZKFC daemons
- ensure that the RPCs made by the health monitor and failover controller
properly authenticate to the target daemons

[jira] [Commented] (HADOOP-8215) Security support for ZK Failover controller

2012-04-02 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244875#comment-13244875
 ] 

Todd Lipcon commented on HADOOP-8215:
-

Because coverage of security is hard to automate, I performed the following 
manual test steps to verify this patch on a secure cluster:

- Set up two NNs with kerberos security enabled
- Use ZK command line to generate digest credentials:
{code}
todd@todd-w510:~/releases/zookeeper-3.4.1-cdh4b1$ java -cp 
lib/*:zookeeper-3.4.1-cdh4b1.jar 
org.apache.zookeeper.server.auth.DigestAuthenticationProvider foo:testing
foo:testing-foo:vlUvLnd8MlacsE80rDuu6ONESbM=
{code}

Add these two the HDFS configuration:
{code}
 property
   nameha.zookeeper.acl/name
   valuedigest:foo:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda/value
 /property
 property
   nameha.zookeeper.auth/name
   valuedigest:foo:testing/value
 /property
{code}

- Run bin/hdfs zkfc -formatZK
- Run bin/hdfs zkfc for each NN
- Run bin/hdfs namenode for each NN
- Verify that one of the NNs becomes active. Kill that NN. Verify that the 
other NN becomes active within a few seconds.
- Verify authentication results in the NN logs:
{code}
12/04/02 17:25:22 INFO authorize.ServiceAuthorizationManager: Authorization 
successfull for hdfs-todd/todd-w...@hadoop.com (auth:KERBEROS) for 
protocol=interface org.apache.hadoop.ha.HAServiceProtocol
{code}

- Use ZK CLI to verify the acls:
{code}
[zk: localhost:2181(CONNECTED) 1] addauth digest foo:testing
[zk: localhost:2181(CONNECTED) 2] ls /hadoop-ha
[ActiveBreadCrumb, ActiveStandbyElectorLock]
[zk: localhost:2181(CONNECTED) 3] getAcl /hadoop-ha
'digest,'foo:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
[zk: localhost:2181(CONNECTED) 4] getAcl /hadoop-ha/ActiveBreadCrumb
'digest,'foo:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
{code}

- Shut down nodes, replace configuration with indirect version:
{code}
 property
   nameha.zookeeper.acl/name
   value@/home/todd/confs/devconf.ha.common/zk-acl.txt/value
 /property
 property
   nameha.zookeeper.auth/name
   value@/home/todd/confs/devconf.ha.common/zk-auth.txt/value
 /property
{code}
and move the actual values to the files as specified above

- Restart ZKFCs, verify that the ACLs are still being correctly used
- chmod 000 the ACL data so it's no longer readable, try to restart one of the 
ZKFCs, verify error:
{code}
Exception in thread main java.io.FileNotFoundException: 
/home/todd/confs/devconf.ha.common/zk-acl.txt (Permission denied)
{code}

 Security support for ZK Failover controller
 ---

 Key: HADOOP-8215
 URL: https://issues.apache.org/jira/browse/HADOOP-8215
 Project: Hadoop Common
  Issue Type: Improvement
  Components: auto-failover, ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8215.txt


 To keep the initial patches manageable, kerberos security is not currently 
 supported in the ZKFC implementation. This JIRA is to support the following 
 important pieces for security:
 - integrate with ZK authentication (kerberos or password-based)
 - allow the user to configure ACLs for the relevant znodes
 - add keytab configuration and login to the ZKFC daemons
 - ensure that the RPCs made by the health monitor and failover controller 
 properly authenticate to the target daemons

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8211) Update commons-net version to 3.1

2012-03-31 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243587#comment-13243587
 ] 

Todd Lipcon commented on HADOOP-8211:
-

+1, assuming you've done a full build locally and run ftpfs-related tests. (are 
there any such? I can't seem to find any, since HDFS-441 removed it from HDFS 
but HADOOP-6119 never re-committed it in Common)

 Update commons-net version to 3.1
 -

 Key: HADOOP-8211
 URL: https://issues.apache.org/jira/browse/HADOOP-8211
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: io, performance
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hadoop-8211.txt


 HADOOP-8210 requires the commons-net version be upgraded. Let's bump it to 
 the latest stable version. The only other user is FtpFs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8210) Common side of HDFS-3148

2012-03-31 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243597#comment-13243597
 ] 

Todd Lipcon commented on HADOOP-8210:
-

{code}
+LinkedHashSetInetAddress addrs = new LinkedHashSetInetAddress();
{code}
I think it's worth changing the return type of this function to LinkedHashSet, 
so it's clear that the ordering here is on purpose. Perhaps also add a comment 
here saying something like:
{code}
// See below for reasoning behind using an ordered set.
{code}



{code}
+// that depend on a particular element being 1st in the array.
+// Eg. getDefaultIP always returns the 1st element.
{code}
Nits: please un-abbreviate first for better readability. Also, e.g. instead 
of Eg. -- or just say For example



{code}
+  ips[i] = addr.getHostAddress();
+  i++;
{code}
I think it's more idiomatic to just put the postincrement inside the []s


- there's a small spurious whitespace change in NetUtils.java
- looks like the pom change is still in this patch (redundant with HADOOP-8211)


 Common side of HDFS-3148
 

 Key: HADOOP-8210
 URL: https://issues.apache.org/jira/browse/HADOOP-8210
 Project: Hadoop Common
  Issue Type: Sub-task
  Components: io, performance
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hadoop-8210.txt


 Common side of HDFS-3148, add necessary DNS and NetUtils methods. Test 
 coverage is in the HDFS-3148 patch. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly

2012-03-30 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242664#comment-13242664
]

Todd Lipcon commented on HADOOP-8220:
-

bq. Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE
configurable?

Currently, ActiveStandbyElector doesn't take a Configuration object. I think
many of the parameters should be changed to be configured via Configuration,
but I didn't want to make this into a bigger scoped change.

bq. There's some inconsistency in capitalization between reJoinElection and
rejoinElectionAfterFailureToBecomeActive

Changed to consistently use reJoin to match the previously existing code.

bq. Might want to do a s/System.currentTimeMillis/Util.now/g

The {{Util}} class is in HDFS, but this code is in common. We don't seem to
have an equivalent in common.

bq. Any reason we shouldn't make LOG_INTERVAL_MS configurable?
It's just test code, so seemed unnecessary.

bq. Add @VisibleForTesting to sleepFor, since it would be private (and probably
static) otherwise. Maybe even add a comment saying why it's not static.
bq. Considering the comment says after sleeping for a short period in
TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that
sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.

Done. I made the overridden method keep a tally of number of slept millis, and
added asserts to the tests to make sure it slept for some time when rejoining.

ZKFailoverController doesn't handle failure to become active correctly
--

Key: HADOOP-8220
URL: https://issues.apache.org/jira/browse/HADOOP-8220
Project: Hadoop Common
Issue Type: Bug
Components: auto-failover, ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
Attachments: hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt,
hadoop-8220.txt

The ZKFC doesn't properly handle the case where the monitored service fails
to become active. Currently, it catches the exception and logs a warning, but
then continues on, after calling quitElection(). This causes a NPE when it
later tries to use the same zkClient instance while handling that same
request. There is a test case, but the test case doesn't ensure that the node
that had the failure is later able to recover properly.

[jira] [Commented] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

2012-03-30 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242677#comment-13242677
]

Todd Lipcon commented on HADOOP-8228:
-

bq. One question: are you positive that the ordering of the two @After methods
either doesn't matter, or is guaranteed to happen in the right order?

The order of the two @After methods is nondeterministic. But, in this case,
it's only important that our @After method runs before the superclass
(ClientBase)'s tearDown. JUnit does guarantee the ordering in this case.

bq. One comment: maybe use a deterministic random seed for the Random instances
you're using? Or at least log the amount of time that the test is sleeping for
and what it's throwing?
Good point. I added additional logging for when it throws exceptions, and for
when it expires sessions. I don't think the deterministic seed helps things,
since the interleaving is still non-deterministic (that's part of the value of
these tests :) )

Auto HA: Refactor tests and add stress tests

Key: HADOOP-8228
URL: https://issues.apache.org/jira/browse/HADOOP-8228
Project: Hadoop Common
Issue Type: Test
Components: auto-failover, ha, test
Affects Versions: Auto Failover (HDFS-3042)
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt

It's important that the ZKFailoverController be robust and not contain race
conditions, etc. One strategy to find potential races is to add stress tests
which exercise the code as fast as possible. This JIRA is to implement some
test cases of this style.

[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover

2012-03-30 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242855#comment-13242855
]

Todd Lipcon commented on HADOOP-8217:
-

bq. 3. ZKFC2 tries to do transitionToStandby() on NN1. RPC times out.
bq. 4. Don't know what happens now in your design

As has been the case in all of the HA work up to and including this point, it
initiates the fence method at this point. The fence method has to do persistent
fencing of the shared resource (eg. disable access to the SAN or STONITH the
node). Please refer to the code in which I think this is fairly clear.

The solution here is to improve the ability to do failover when graceful
fencing suffices. In many failover cases it's preferable to _not_ have to
invoke STONITH or storage fencing, since those mechanisms will often require
administrative intervention to un-fence.

bq. Given, the above, how will NN1 receive the zxid from ZKFC2? If it does not
then the solution is invalid. Hari's scenario exemplifies this.

All transitionToActive/transitionToStandby calls would include the zxid. So,
the sequence becomes:

1. ZKFC1 gets active lock (zxid=1)
2. ZKFC1 is about to send transitionToActive(1) and machine freezes (eg GC
pause + swapping)
3. ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock (zxid=2)
4. ZKFC2 calls NN1.transitionToStandby(2) and NN2.transitionToActive(2).
5. ZKFC1 wakes up from pause, calls NN1.transitionToActive(1). NN1 rejects the
request because it previously accepted zxid=2 in step 4 above.

or the failure case:
4(failure case): if NN1.transitionToStandby() times out or fails, the
non-graceful fencing is initiated (same as in existing HA code for the last
several months)
5(failure case with storage fencing): ZKFC1 wakes up from pause, and calls
NN1.transitionToActive(1). NN1 tries to access the shared edits storage and
fails, because it has been fenced. So, there is no split-brain.
5(failure case with STONITH): ZKFC1 never wakes up from pause, because its
power plug has been pulled. So, there is no split-brain.

Edge case split-brain race in ZK-based auto-failover

Key: HADOOP-8217
URL: https://issues.apache.org/jira/browse/HADOOP-8217
Project: Hadoop Common
Issue Type: Bug
Components: auto-failover, ha
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Attachments: hadoop-8217-testcase.txt

As discussed in HADOOP-8206, the current design for automatic failover has
the following race:
- ZKFC1 gets active lock
- ZKFC1 is about to send transitionToActive() and machine freezes (eg GC
pause + swapping)
- ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
- ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
- ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad
situation
This is rare, since it requires ZKFC1 to freeze longer than its ZK session
timeout, but worth fixing, since the results can be disastrous.

[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover

2012-03-30 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242880#comment-13242880
 ] 

Todd Lipcon commented on HADOOP-8217:
-

bq. Can you please point me to the existing HA code for last several months? I 
thought we have manual HA in which admin does fencing.

See HDFS-2179 (committed last August), which added the fencing code, and 
HADOOP-7938, which added the fencing behavior to the manual failover controller 
(committed in January).

The HA guide 
({{hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm}})
 also details the configuration and operation of the fencing:
{quote}
  * failover - initiate a failover between two NameNodes

This subcommand causes a failover from the first provided NameNode to the
second. If the first NameNode is in the Standby state, this command simply
transitions the second to the Active state without error. If the first 
NameNode
is in the Active state, an attempt will be made to gracefully transition it 
to
the Standby state. If this fails, the fencing methods (as configured by
dfs.ha.fencing.methods) will be attempted in order until one
succeeds. Only after this process will the second NameNode be transitioned 
to
the Active state. If no fencing method succeeds, the second NameNode will 
not
be transitioned to the Active state, and an error will be returned.
{quote}

 Edge case split-brain race in ZK-based auto-failover
 

 Key: HADOOP-8217
 URL: https://issues.apache.org/jira/browse/HADOOP-8217
 Project: Hadoop Common
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8217-testcase.txt


 As discussed in HADOOP-8206, the current design for automatic failover has 
 the following race:
 - ZKFC1 gets active lock
 - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC 
 pause + swapping)
 - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
 - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
 - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad 
 situation
 This is rare, since it requires ZKFC1 to freeze longer than its ZK session 
 timeout, but worth fixing, since the results can be disastrous.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly

2012-03-30 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242895#comment-13242895
 ] 

Todd Lipcon commented on HADOOP-8202:
-

This patch also introduced the following bug: if the proxy.close() function 
throws an IOException, then HadoopIllegalArgumentException will be thrown, 
claiming that the proxy doesn't implement Closeable. This is the wrong error to 
throw, and is a regression in behavior (failure to close due to IOE should just 
be a warning, as it was previously). Hari, would you mind fixing this?

 stopproxy() is not closing the proxies correctly
 

 Key: HADOOP-8202
 URL: https://issues.apache.org/jira/browse/HADOOP-8202
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.24.0
Reporter: Hari Mankude
Assignee: Hari Mankude
Priority: Minor
 Fix For: 2.0.0

 Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, 
 HADOOP-8202-3.patch, HADOOP-8202-4.patch, HADOOP-8202.patch, HADOOP-8202.patch


 I was running testbackupnode and noticed that NNprotocol proxy was not being 
 closed. Talked with Suresh and he observed that most of the protocols do not 
 implement ProtocolTranslator and hence the logic in stopproxy() does not 
 work. Instead, since all of them are closeable, Suresh suggested that 
 closeable property should be used at close.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover

2012-03-30 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242944#comment-13242944
]

Todd Lipcon commented on HADOOP-8217:
-

bq. I would like to question the value of FC2 calling NN1.transitionToStandby()
in general. FC1 on NN1 is supposed to call NN1.transitionToStandby() because
thats is FC1's responsibility upon losing the leader lock.

This doesn't work, since FC1 can take arbitrarily long to notice that it has
lost its lock.

bq. Secondly, based on the recent work done to add breadcrumbs to the
ActiveStandbyElector, FC2 is going to fence NN1 if NN1 has not gracefully given
up the lock, which is clearly the case here. So the problem is already solved
unless I am mistaken.

But the first stage of fencing is to gracefully ask the NN to go to standby.
This is exactly the problem here. If, instead, we always required that we
always use an aggressive fencing mechanism (STONITH/NAS fencing), you're right
that there would not be a problem. But we can avoid that in many cases -- for
example, imagine that the active node loses its connection to the ZK quorum,
but still has a connection to the other NN (eg by a crossover cable). In this
case it will leave its breadcrumb znode there, but the new active can easily
transition it to standby.

Here's another way of looking at this JIRA:
- the aggressive fencing mechanisms have the property of being persistent.
i.e after fencing, the node cannot become active, even if asked to.
- the graceful fencing mechanism (transitionToStandby() RPC) does not
currently have the property of being persistent. If another older node asks
it to become active after it's been gracefully fenced, it will do so
incorrectly.
- This JIRA makes graceful fencing persistent, so it can be used correctly.

Regarding the ActiveStandbyElector callback for {{becomeStandby}}, I actually
think it's redundant. There are two cases in which it could be called:
- If already standby, it's a no-op
- If active, then this indicates that the elector lost its znode. Since it lost
its znode (rather than quitting the election gracefully), it will leave its
breadcrumb behind. Thus, the other node will fence it. So, calling
transitionToStandby is redundant with fencing which the other node will have to
perform anyway.

Edge case split-brain race in ZK-based auto-failover

[jira] [Commented] (HADOOP-8212) Improve ActiveStandbyElector's behavior when session expires

2012-03-29 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241832#comment-13241832
 ] 

Todd Lipcon commented on HADOOP-8212:
-

Thanks for reviewing the addendum, and for your comments. I'll commit the 
addendum to the new HDFS-3042 branch momentarily.

 Improve ActiveStandbyElector's behavior when session expires
 

 Key: HADOOP-8212
 URL: https://issues.apache.org/jira/browse/HADOOP-8212
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: Auto Failover (HDFS-3042)

 Attachments: hadoop-8212-delta-bikas.txt, hadoop-8212.txt, 
 hadoop-8212.txt


 Currently when the ZK session expires, it results in a fatal error being sent 
 to the application callback. This is not the best behavior -- for example, in 
 the case of HA, if ZK goes down, we would like the current state to be 
 maintained, rather than causing either NN to abort. When the ZK clients are 
 able to reconnect, they should sort out the correct leader based on the 
 normal locking schemes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly

2012-03-29 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241935#comment-13241935
 ] 

Todd Lipcon commented on HADOOP-8220:
-

Actually moving the error handling code to the call site (instead of inside 
becomeActive()) introduced a bug, since we call becomeActive() from another 
spot as well, in the StatCallback. So we need to have similar code there, or 
move the error handling back up into becomeActive()

 ZKFailoverController doesn't handle failure to become active correctly
 --

 Key: HADOOP-8220
 URL: https://issues.apache.org/jira/browse/HADOOP-8220
 Project: Hadoop Common
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt


 The ZKFC doesn't properly handle the case where the monitored service fails 
 to become active. Currently, it catches the exception and logs a warning, but 
 then continues on, after calling quitElection(). This causes a NPE when it 
 later tries to use the same zkClient instance while handling that same 
 request. There is a test case, but the test case doesn't ensure that the node 
 that had the failure is later able to recover properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover

2012-03-28 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240237#comment-13240237
]

Todd Lipcon commented on HADOOP-8217:
-

Suresh: we've already had a meeting ostensibly for this purpose, I think. There
is also a design document posted to HDFS-2185. The document doesn't include
every possible scenario, because I don't have infinite foresight. I don't think
having meetings or more reviews of the design doc will help that.

For example, with the original manual-failover project, we had several design
meetings as well as a design document posted on HDFS-1623. Looking back at that
project, the design document captured the overall idea (like the HDFS-2185 one
does here) but did not foresee some of the trickiest issues we dealt with
during implementation (for example, how to deal with invalidations with regard
to datanode fencing, how to handle safe mode, how to deal with delegation
tokens, etc).

In that project, as we came upon each new scenario to deal with, we opened a
JIRA and had a discussion on the design solution for that particular scenario.
I don't see why we can't do the same here. Nor do I see why we are likely to be
able to foresee all the corner cases a priori here better than we were able to
with HDFS-1623.

So, I am not going to pause work to wait for meetings or more design
discussion. If you see problems with the design, please comment on the design
doc on HDFS-2185, or on the individual JIRAs which seem to have problems. I'm
happy to address them, even after commit (eg I'm currently addressing Bikas's
review comments on HADOOP-8212)

Since there seems to be concern that we are moving too fast, I will create an
auto-failover branch later tonight to continue working on implementing this
design. I'll also create a new auto-failover component on JIRA so it's easier
to follow them. If you have concerns about the implementation or the design
when it comes time to merge it, please do vote against the merge, voicing
whatever objections you might have. And please do comment along the way if you
see issues.

Thanks.

Edge case split-brain race in ZK-based auto-failover

Key: HADOOP-8217
URL: https://issues.apache.org/jira/browse/HADOOP-8217
Project: Hadoop Common
Issue Type: Bug
Components: ha
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon

[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly

2012-03-28 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240776#comment-13240776
]

Todd Lipcon commented on HADOOP-8220:
-

Yep, your updated description of the tight loop is exactly right. Sorry, I
didn't note the fact that becomeActive() throws an exception in this scenario.

New draft of the patch attached.

- Added a true unit test for the new changes, in addition to the functional
test from the prior revision (TestActiveStandbyElector#testFailToBecomeActive)
- Change the control flow so that the success and error cases are kept near
each other (suggested by Bikas above)
- Changed the sleep calls to be wrapped in a {{sleepFor(ms)}} function, so it's
easy to disable the sleeping behavior in the unit tests. Otherwise the tests
ran longer for no good reason.

In response to a couple comments above that got lost in the discussion:
{quote}
2. becomeActive() should be protected by a timeout also. If NN is taking far
too long to return, FC should declare failure and give up the lock. Otherwise,
it is a deadlock.
{quote}
This is really difficult to do reliably, since there's no good way to 'cancel'
the callback. The {{transitionToActive}} RPC itself should have a timeout
attached -- it's much more straightforward to do that than to try to make
ActiveStandbyElector guard against arbitrary code running too long in the
callback. I added a note to the javadoc indicating this.

{quote}
Do you really want to commit the logs added to ActiveStandbyTestUtil?
{quote}
Yes, I found that when I had a test failure due to timeout, it was difficult to
debug, since I couldn't easily tell which node had the lock at the time the
test timed out. I rate-limited the logging to only two per second, so it
shouldn't make the logs too noisy, while retaining the advantage of seeing
what's going on better if there is a timeout.

ZKFailoverController doesn't handle failure to become active correctly
--

[jira] [Commented] (HADOOP-8212) Improve ActiveStandbyElector's behavior when session expires

2012-03-27 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239631#comment-13239631
]

Todd Lipcon commented on HADOOP-8212:
-

bq. I think we want to added similar handling in the StatCallback. Its another
race waiting to happen.

The patch does add the same handling to StatCallback. It uses the ZooKeeper
context parameter to pass the original zkClient. Unfortunately the Watcher
interface doesn't have any context object, which is why I had to introduce the
wrapper class there.

bq. The comment on processWatchEvent needs to change slightly to reflect that
its the proxied watcher callback handler.

Does the following look good?
{code}
- * interface implementation of Zookeeper watch events (connection and node)
+ * interface implementation of Zookeeper watch events (connection and node),
+ * proxied by {@link WatcherWithClientRef}.
{code}

bq. Whats the hurry?

In my experience working on similar projects in the past, getting all the
initial code in place is only half the battle. The real work starts once the
code is there and you start banging on it in realistic test scenarios. We'd
like to see automatic failover be a supported piece of the HA solution in
0.23.x (..err..2.0), and to hit that timeline, we need to get into the latter
phase ASAP.

I'm less aggressive when it comes to changing existing code, but since this is
all new code, there's no risk of regressing working features by moving fast
here. Once it starts to stabilize we can afford to slow down the rate of
change. If you'd prefer, I'm happy to create a feature branch for auto-failover
and then call a merge vote when it's ready for the full QA onslaught.

Improve ActiveStandbyElector's behavior when session expires

Key: HADOOP-8212
URL: https://issues.apache.org/jira/browse/HADOOP-8212
Project: Hadoop Common
Issue Type: Improvement
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Fix For: 0.23.3, 0.24.0

Attachments: hadoop-8212.txt, hadoop-8212.txt

Currently when the ZK session expires, it results in a fatal error being sent
to the application callback. This is not the best behavior -- for example, in
the case of HA, if ZK goes down, we would like the current state to be
maintained, rather than causing either NN to abort. When the ZK clients are
able to reconnect, they should sort out the correct leader based on the
normal locking schemes.

[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly

2012-03-27 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239641#comment-13239641
 ] 

Todd Lipcon commented on HADOOP-8220:
-

I'll add a new test to the ActiveStandbyElector-specific code for this. I was 
testing it via the integration test, but you're right that adding to the unit 
tests makes sense too.

bq. How does NPE occur when the elector makes sure the client is recreated upon 
rejoining the election? Which zkClient are you talking about?

The NPE occurred in the previous code because we had the following sequence:
- createNode succeeded
- called ZKFC becomeActive() callback
-- becomeActive() throws exception
-- ZKFC had a catch() clause which called quitElection () (it turned out this 
wasn't the right behavior)
--- quitElection() nulled out zkClient
- ActiveStandbyElector called monitorNode(), which tried to use zkClient, which 
had just been nulled out.

The new behavior avoids this, since the error handling patch is in 
ActiveStandbyElector itself. This makes it easier to get the right semantics.

bq. What is the purpose of adding the sleep? Could you please elaborate?

Without the sleep, it will do a tight loop retrying to become active. This 
generates a lot of log spew and has little actual benefit. If instead we retry 
only once a second, then (a) the  logs are more readable, and (b) if there is 
another StandbyNode in the cluster, it will get a chance to try to become 
active.

I will add a comment to this effect in the code.

 ZKFailoverController doesn't handle failure to become active correctly
 --

 Key: HADOOP-8220
 URL: https://issues.apache.org/jira/browse/HADOOP-8220
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8220.txt


 The ZKFC doesn't properly handle the case where the monitored service fails 
 to become active. Currently, it catches the exception and logs a warning, but 
 then continues on, after calling quitElection(). This causes a NPE when it 
 later tries to use the same zkClient instance while handling that same 
 request. There is a test case, but the test case doesn't ensure that the node 
 that had the failure is later able to recover properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock

2012-03-27 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239685#comment-13239685
 ] 

Todd Lipcon commented on HADOOP-8218:
-

I'm fine with that, too. Suresh/Tom? Pick your patch, I'll do it. I just want 
to get something committed today to fix the failing tests.

 RPC.closeProxy shouldn't throw error when closing a mock
 

 Key: HADOOP-8218
 URL: https://issues.apache.org/jira/browse/HADOOP-8218
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc, test
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8218.txt, hadoop-8218.txt


 HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if 
 called on an object which doesn't implement Closeable. Unfortunately, we use 
 mock objects in many test cases, and those mocks don't implement Closeable. 
 This is causing TestZKFailoverController to fail in trunk, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly

2012-03-27 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239957#comment-13239957
 ] 

Todd Lipcon commented on HADOOP-8220:
-

bq. Ah. Now I get it. The elector should be robust against client code (ZKFC in 
this case). I like Hari's proposal of using a return value to inform about 
fail/success of becoming active. I am not that familiar with standard practices 
in Java - are return values preferred or exceptions?

You got it. Exceptions are generally preferred for cases like this -- since we 
have to handle the error condition regardless of whether it's a usual error or 
whether it was something like a NPE or other truly exceptional condition. So 
even with a boolean return type, we'd need a try/catch clause. Does that make 
sense? (I also had originally made it return boolean but then changed it to an 
exception)

bq. I did not understand where the tight loop is? Do you mean (Elector gets 
lock-ZKFC fails to becomes active)?
Yep. In my test I saw that the standby would retry in a tight loop like that:
# Succeed in getting lock
# Call becomeActive()
# drop ZK session (lock disappears)
# reconnect to ZK
# Goto 1

I simply inserted a sleep between dropping the connection and reconnecting. 
This gives the old active a better chance to become active again (or if there 
is a third node in the future, it would have a chance to take the lock). In the 
future we may want to add some random jitter and exponential backoff, but at 
this point let's keep it simple.

 ZKFailoverController doesn't handle failure to become active correctly
 --

 Key: HADOOP-8220
 URL: https://issues.apache.org/jira/browse/HADOOP-8220
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8220.txt


 The ZKFC doesn't properly handle the case where the monitored service fails 
 to become active. Currently, it catches the exception and logs a warning, but 
 then continues on, after calling quitElection(). This causes a NPE when it 
 later tries to use the same zkClient instance while handling that same 
 request. There is a test case, but the test case doesn't ensure that the node 
 that had the failure is later able to recover properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock

2012-03-27 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240162#comment-13240162
 ] 

Todd Lipcon commented on HADOOP-8218:
-

Since the patch is up, and people seem OK with it, I'll commit the version Tom 
suggested (the latter patch)

 RPC.closeProxy shouldn't throw error when closing a mock
 

 Key: HADOOP-8218
 URL: https://issues.apache.org/jira/browse/HADOOP-8218
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc, test
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8218.txt, hadoop-8218.txt


 HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if 
 called on an object which doesn't implement Closeable. Unfortunately, we use 
 mock objects in many test cases, and those mocks don't implement Closeable. 
 This is causing TestZKFailoverController to fail in trunk, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238125#comment-13238125
 ] 

Todd Lipcon commented on HADOOP-8202:
-

Instead of adding the instanceof check anywhere we use an object that might be 
a mock, can we instead change the protocol interfaces themselves to extend 
Closeable? That will make sure that any proxy implementations themselves take 
care of extending it, and also will solve the mock issue (since the mock itself 
will then also extend Closeable).

 stopproxy() is not closing the proxies correctly
 

 Key: HADOOP-8202
 URL: https://issues.apache.org/jira/browse/HADOOP-8202
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.24.0
Reporter: Hari Mankude
Assignee: Hari Mankude
Priority: Minor
 Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, 
 HADOOP-8202-3.patch, HADOOP-8202.patch, HADOOP-8202.patch


 I was running testbackupnode and noticed that NNprotocol proxy was not being 
 closed. Talked with Suresh and he observed that most of the protocols do not 
 implement ProtocolTranslator and hence the logic in stopproxy() does not 
 work. Instead, since all of them are closeable, Suresh suggested that 
 closeable property should be used at close.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238597#comment-13238597
 ] 

Todd Lipcon commented on HADOOP-8202:
-

Sure, go ahead and commit. Thanks.

 stopproxy() is not closing the proxies correctly
 

 Key: HADOOP-8202
 URL: https://issues.apache.org/jira/browse/HADOOP-8202
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.24.0
Reporter: Hari Mankude
Assignee: Hari Mankude
Priority: Minor
 Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, 
 HADOOP-8202-3.patch, HADOOP-8202.patch, HADOOP-8202.patch


 I was running testbackupnode and noticed that NNprotocol proxy was not being 
 closed. Talked with Suresh and he observed that most of the protocols do not 
 implement ProtocolTranslator and hence the logic in stopproxy() does not 
 work. Instead, since all of them are closeable, Suresh suggested that 
 closeable property should be used at close.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8131) FsShell put doesn't correctly handle a non-existent dir

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238646#comment-13238646
 ] 

Todd Lipcon commented on HADOOP-8131:
-

If it's not too huge a pain, I'd be in favor of a deprecated config flag which 
restores the old behavior (while emitting a warning that it's deprecated and to 
be removed in a future version). This will help people migrate to 0.23, since 
I'm sure there are lots of cases where people have shell scripts running as 
part of production workflows.

 FsShell put doesn't correctly handle a non-existent dir
 ---

 Key: HADOOP-8131
 URL: https://issues.apache.org/jira/browse/HADOOP-8131
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Fix For: 0.23.2

 Attachments: HADOOP-8131.patch, HADOOP-8131.patch, HADOOP-8131.patch, 
 HADOOP-8131.patch


 {noformat}
 $ hadoop fs -ls
 ls: `.': No such file or directory
 $ hadoop fs -put file
 $ hadoop fs -ls
 Found 1 items
 -rw-r--r--   1 kihwal supergroup   2076 2011-11-04 10:37 .._COPYING_
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8131) FsShell put doesn't correctly handle a non-existent dir

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238677#comment-13238677
]

Todd Lipcon commented on HADOOP-8131:
-

bq. Just to confirm: you mean a real conf key, not a cmdline flag, right?
Yep - something that could be set system-wide in core-site.xml. When users
upgrade, they expect they may have to tweak some confs for the new version, but
it's harder to ask them to change all of their shell scripts.

bq. In either case it will be a change now or change later scenario
Right. The idea is that they would have some warning (a full major version)
before their code stops working. Our general policy is to only make the
breaking change after having the deprecated support for a full major version --
in which case it would go away in 0.24.0.

bq. Would this bring back the issue of left out _temporary dirs?
(MAPREDUCE-1272)
I would think the MR task would be using the new non-deprecated API which
doesn't recursively create parents.

FsShell put doesn't correctly handle a non-existent dir
---

Key: HADOOP-8131
URL: https://issues.apache.org/jira/browse/HADOOP-8131
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
Fix For: 0.23.2

Attachments: HADOOP-8131.patch, HADOOP-8131.patch, HADOOP-8131.patch,
HADOOP-8131.patch

{noformat}
$ hadoop fs -ls
ls: `.': No such file or directory
$ hadoop fs -put file
$ hadoop fs -ls
Found 1 items
-rw-r--r-- 1 kihwal supergroup 2076 2011-11-04 10:37 .._COPYING_
{noformat}

[jira] [Commented] (HADOOP-8212) Improve ActiveStandbyElector's behavior when session expires

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238748#comment-13238748
 ] 

Todd Lipcon commented on HADOOP-8212:
-

Sure, happy to address post-commit. Sorry for moving quick - trying to get at 
least an initial implementation of auto failover committed quickly, and we can 
continue to improve and fix it up.

 Improve ActiveStandbyElector's behavior when session expires
 

 Key: HADOOP-8212
 URL: https://issues.apache.org/jira/browse/HADOOP-8212
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.23.3, 0.24.0

 Attachments: hadoop-8212.txt, hadoop-8212.txt


 Currently when the ZK session expires, it results in a fatal error being sent 
 to the application callback. This is not the best behavior -- for example, in 
 the case of HA, if ZK goes down, we would like the current state to be 
 maintained, rather than causing either NN to abort. When the ZK clients are 
 able to reconnect, they should sort out the correct leader based on the 
 normal locking schemes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238915#comment-13238915
 ] 

Todd Lipcon commented on HADOOP-8217:
-

My thinking for the solution is the following:
- add a parameter to transitionToStandby/transitionToActive which is a {{long 
logicalTime}}
- when the ZKFC acquires the lock znode, it makes a note of the zxid (ZK 
transaction ID)
- when it then asks the old active to go to standby, or asks its own node to go 
active, it includes the zxid
- the NN itself maintains a record of the highest zxid it has heard. If it 
receives a state transition request with an older zxid, it ignores it.

This would solve the race as described, since when ZKFC2 calls 
NN1.transitionToStandby(), it hands NN1 a higher zxid than ZKFC1 saw. So when 
ZKFC1 then asks it to go active, the request is denied.

There is still potentially some race involving the NNs restarting quickly and 
forgetting the highest zxid. I'm not sure whether the right solution there is 
to record the info persistently, or to attach a UUID to each NN startup, and 
use that to make sure we don't target a newer instance of a NN with an RPC that 
was meant for an earlier one.

Other creative solutions appreciated :)

 Edge case split-brain race in ZK-based auto-failover
 

 Key: HADOOP-8217
 URL: https://issues.apache.org/jira/browse/HADOOP-8217
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 As discussed in HADOOP-8206, the current design for automatic failover has 
 the following race:
 - ZKFC1 gets active lock
 - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC 
 pause + swapping)
 - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
 - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
 - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad 
 situation
 This is rare, since it requires ZKFC1 to freeze longer than its ZK session 
 timeout, but worth fixing, since the results can be disastrous.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8206) Common portion of ZK-based failover controller

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238982#comment-13238982
 ] 

Todd Lipcon commented on HADOOP-8206:
-

bq. Makes sense to me. One question, though - we seem to be inconsistently 
using IllegalArgumentException and HadoopIllegalArgumentException. Is there any 
good reason for that?

I'm not entirely sure -- looking across the code as a whole, we have a 10:1 
ratio of IllegalArgumentException vs HadoopIllegalArgumentException. So I'm 
erring on the side of what's used more often, except in a few places where we 
directly expose it as a potentially user-visible error (like bad command line 
arguments).

 Common portion of ZK-based failover controller
 --

 Key: HADOOP-8206
 URL: https://issues.apache.org/jira/browse/HADOOP-8206
 Project: Hadoop Common
  Issue Type: New Feature
  Components: ha
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8206.txt, hadoop-8206.txt, hadoop-8206.txt


 This JIRA is for the Common (generic) portion of HDFS-2185. It can't run on 
 its own, but this JIRA will include unit tests using mock/dummy services.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239120#comment-13239120
 ] 

Todd Lipcon commented on HADOOP-8202:
-

This broke TestZKFailoverController, since it's now getting 
IllegalArgumentException trying to close the proxy.

 stopproxy() is not closing the proxies correctly
 

 Key: HADOOP-8202
 URL: https://issues.apache.org/jira/browse/HADOOP-8202
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.24.0
Reporter: Hari Mankude
Assignee: Hari Mankude
Priority: Minor
 Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, 
 HADOOP-8202-3.patch, HADOOP-8202-4.patch, HADOOP-8202.patch, HADOOP-8202.patch


 I was running testbackupnode and noticed that NNprotocol proxy was not being 
 closed. Talked with Suresh and he observed that most of the protocols do not 
 implement ProtocolTranslator and hence the logic in stopproxy() does not 
 work. Instead, since all of them are closeable, Suresh suggested that 
 closeable property should be used at close.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239121#comment-13239121
]

Todd Lipcon commented on HADOOP-8218:
-

I see three options:
1) Anywhere we call RPC.closeProxy, we check if (foo instanceof Closeable) {
... } first. But, that defeats the whole purpose of throwing the exception
when we pass non-closeables, so we might as well just revert the behavior back
to the original rather than does this.
2) In RPC.closeProxy, if the object doesn't implement Closeable, check if the
proxy is a mock object. We can do this by looking for the string
EnhancerByMockitoWithCGLIB in the class name. If we see that, pass through.
3) Anywhere we mock out an IPC protocol, we could use the syntax
{{mock(FooProtocol.class, withSettings().extraInterfaces(Closeable.class));}}.
I am not a fan of this, since it leaks the issue out to all of the test code,
rather than localizing the workaround in the one place that matters. Plus,
newer users of the mock framework won't know this advanced usage syntax (I had
to google for a while to figure it out)

So, I plan to implement #2.

RPC.closeProxy shouldn't throw error when closing a mock

Key: HADOOP-8218
URL: https://issues.apache.org/jira/browse/HADOOP-8218
Project: Hadoop Common
Issue Type: Bug
Components: ipc, test
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical

HADOOP-8202 changed the behavior of RPC.stopProxy() to throw an exception if
called on an object which doesn't implement Closeable. Unfortunately, we use
mock objects in many test cases, and those mocks don't implement Closeable.
This is causing TestZKFailoverController to fail in trunk, for example.

[jira] [Commented] (HADOOP-8218) RPC.closeProxy shouldn't throw error when closing a mock

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239171#comment-13239171
]

Todd Lipcon commented on HADOOP-8218:
-

bq. Todd, can the mock object implement a new interface that extends
HAServiceProtocol and Closeable? Will that solve the problem?

That avoids the advanced syntax, but requires that you make such a fake
interface everywhere you mock a protocol, which again is somewhat
counter-intuitive.

bq. #2 makes the main code aware of test specifics, which isn't a good idea.
How about doing #3 by creating a helper method that encapsulates that code in
one place?

I was thinking about doing that... ie a
MockitoUtils.mockIpcProtocol(FooProtocol.class). Since it seems people like
this idea better than #2, I'll prepare such a patch.

RPC.closeProxy shouldn't throw error when closing a mock

[jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly

2012-03-26 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239184#comment-13239184
 ] 

Todd Lipcon commented on HADOOP-8220:
-

Tests failing due to HADOOP-8218

 ZKFailoverController doesn't handle failure to become active correctly
 --

 Key: HADOOP-8220
 URL: https://issues.apache.org/jira/browse/HADOOP-8220
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hadoop-8220.txt


 The ZKFC doesn't properly handle the case where the monitored service fails 
 to become active. Currently, it catches the exception and logs a warning, but 
 then continues on, after calling quitElection(). This causes a NPE when it 
 later tries to use the same zkClient instance while handling that same 
 request. There is a test case, but the test case doesn't ensure that the node 
 that had the failure is later able to recover properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8207) createproxy() in TestHealthMonitor is throwing NPE

2012-03-25 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237788#comment-13237788
 ] 

Todd Lipcon commented on HADOOP-8207:
-

I think this is dup of HADOOP-8204

 createproxy() in TestHealthMonitor is throwing NPE
 --

 Key: HADOOP-8207
 URL: https://issues.apache.org/jira/browse/HADOOP-8207
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Affects Versions: 0.24.0
Reporter: Hari Mankude
Priority: Minor

 Looking at the test log output, createproxy in testhealthmonitor is 
 triggering NPE resulting null proxy. This creates other test failures.
 2012-03-24 22:16:11,591 FATAL ha.HealthMonitor 
 (HealthMonitor.java:uncaughtException(268)) - Health monitor failed
 java.lang.NullPointerException
 at 
 org.apache.hadoop.ha.TestHealthMonitor$1.createProxy(TestHealthMonitor.java:75)
 at 
 org.apache.hadoop.ha.HealthMonitor.tryConnect(HealthMonitor.java:171)
 at 
 org.apache.hadoop.ha.HealthMonitor.loopUntilConnected(HealthMonitor.java:158)
 at 
 org.apache.hadoop.ha.HealthMonitor.access$500(HealthMonitor.java:52)
 at 
 org.apache.hadoop.ha.HealthMonitor$MonitorDaemon.run(HealthMonitor.java:278)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8208) Disallow self failover

2012-03-25 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237979#comment-13237979
 ] 

Todd Lipcon commented on HADOOP-8208:
-

Looks good. I just reverted HADOOP-8193 since it caused some other test 
failures, but when it is recommitted we can commit this. +1 in advance

 Disallow self failover
 --

 Key: HADOOP-8208
 URL: https://issues.apache.org/jira/browse/HADOOP-8208
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hadoop-8208.txt, hdfs-3145.txt


 It is currently possible for users to make a standby NameNode failover to 
 itself and become active. We shouldn't allow this to happen in case operators 
 mistype and miss the fact that there are now two active NNs.
 {noformat}
 bash-4.1$ hdfs haadmin -ns ha-nn-uri -failover nn2 nn2
 Failover from nn2 to nn2 successful
 {noformat}
 After the failover above, nn2 will be active.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8202) stopproxy() is not closing the proxies correctly

2012-03-24 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237720#comment-13237720
 ] 

Todd Lipcon commented on HADOOP-8202:
-

I think maintaining the ability to support mockito spies/mocks for proxies is 
important. We use it for simulating all kinds of failure conditions -- I'm 
surprised you didn't have a lot of HDFS failures from the same issue.

 stopproxy() is not closing the proxies correctly
 

 Key: HADOOP-8202
 URL: https://issues.apache.org/jira/browse/HADOOP-8202
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.24.0
Reporter: Hari Mankude
Assignee: Hari Mankude
Priority: Minor
 Attachments: HADOOP-8202-1.patch, HADOOP-8202-2.patch, 
 HADOOP-8202.patch, HADOOP-8202.patch


 I was running testbackupnode and noticed that NNprotocol proxy was not being 
 closed. Talked with Suresh and he observed that most of the protocols do not 
 implement ProtocolTranslator and hence the logic in stopproxy() does not 
 work. Instead, since all of them are closeable, Suresh suggested that 
 closeable property should be used at close.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-23 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236934#comment-13236934
]

Todd Lipcon commented on HADOOP-8163:
-

Hi Bikas. I think your ideas have some merit, especially with regard to a fully
general election framework. But since we only have one user of this framework
at this point (HDFS) and we currently only support a single standby node, I
would prefer to punt these changes to another JIRA as additional improvements.
This will let us move forward with the high priority task of auto failover for
HA NNs, rather than getting distracted making this extremely general.

bq. Secondly, we are performing blocking calls on the ZKClient callback that
happens on the ZK threads. It is advisable to not block ZK client threads for
long

This is only the case if you have other operations that are waiting on timely
delivery of callbacks. In the case of the election framework, all of our
notifications from ZK have to be received in-order and processed sequentially,
or else we have a huge explosion of possible interactions to worry about. Doing
blocking calls in the callbacks will _not_ result in lost ZK leases, etc. To
quote from the ZK programmer's guide:

All IO happens on the IO thread (using Java NIO). All event callbacks happen
on the event thread. Session maintenance such as reconnecting to ZooKeeper
servers and maintaining heartbeat is done on the IO thread. Responses for
synchronous methods are also processed in the IO thread. All responses to
asynchronous methods and watch events are processed on the event thread...
Callbacks do not block the processing of the IO thread or the processing of the
synchronous calls

bq. Thirdly, how about using the setData(breadcrumb, appData, version)?

Let me see about making this change. Like you said, it's a good safety check.

Improve ActiveStandbyElector to provide hooks for fencing old active

Key: HADOOP-8163
URL: https://issues.apache.org/jira/browse/HADOOP-8163
Project: Hadoop Common
Issue Type: Improvement
Components: ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt,
hadoop-8163.txt

When a new node becomes active in an HA setup, it may sometimes have to take
fencing actions against the node that was formerly active. This JIRA extends
the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK
directory, which acts as a second copy of the active node's information.
Then, if the active loses its ZK session, the next active to be elected may
easily locate the unfenced node to take the appropriate actions.

[jira] [Commented] (HADOOP-8060) Add a capability to use of consistent checksums for append and copy

2012-03-23 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236965#comment-13236965
 ] 

Todd Lipcon commented on HADOOP-8060:
-

Doing shallow conf comparison as part of the FS key seems a bit dangerous -- 
I'm guessing we'll end up with a lot of leakage issues in long running daemons 
like the NM/RM.

Anyone else have some other ideas how to deal with this? I don't think the 
CreateFlag idea is bad -- maybe better than futzing with the cache.

 Add a capability to use of consistent checksums for append and copy
 ---

 Key: HADOOP-8060
 URL: https://issues.apache.org/jira/browse/HADOOP-8060
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs, util
Affects Versions: 0.23.0, 0.23.1, 0.24.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
 Fix For: 0.23.2, 0.24.0


 After the improved CRC32C checksum feature became default, some of use cases 
 involving data movement are no longer supported.  For example, when running 
 DistCp to copy from a file stored with the CRC32 checksum to a new cluster 
 with the CRC32C set to default checksum, the final data integrity check fails 
 because of mismatch in checksums.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-23 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236970#comment-13236970
 ] 

Todd Lipcon commented on HADOOP-8163:
-

bq. In my experience API's once made are hard to change. It would be hard for 
someone to change the control flow later once important services like NN HA 
depend on the current flow. So punting it for the future would be quite a 
distant future indeed

Given this is an internal API, there shouldn't be any resistance to changing it 
in the future. It's marked Private/Evolving, meaning that there aren't 
guarantees of compatibility to external consumers, and that even for internal 
consumers it's likely to change as use cases evolve. I'll file a follow-up JIRA 
to consider your recommended API changes, OK?


 Improve ActiveStandbyElector to provide hooks for fencing old active
 

 Key: HADOOP-8163
 URL: https://issues.apache.org/jira/browse/HADOOP-8163
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt, 
 hadoop-8163.txt


 When a new node becomes active in an HA setup, it may sometimes have to take 
 fencing actions against the node that was formerly active. This JIRA extends 
 the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK 
 directory, which acts as a second copy of the active node's information. 
 Then, if the active loses its ZK session, the next active to be elected may 
 easily locate the unfenced node to take the appropriate actions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8193) Refactor FailoverController/HAAdmin code to add an abstract class for target services

2012-03-23 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236984#comment-13236984
 ] 

Todd Lipcon commented on HADOOP-8193:
-

Also ran findbugs on common and HDFS, there were no additional warnings.

 Refactor FailoverController/HAAdmin code to add an abstract class for 
 target services
 ---

 Key: HADOOP-8193
 URL: https://issues.apache.org/jira/browse/HADOOP-8193
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ha
Affects Versions: 0.23.3, 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8193.txt, hadoop-8193.txt


 In working at HADOOP-8077, HDFS-3084, and HDFS-3072, I ran into various 
 difficulties which are an artifact of the current design. A few of these:
 - the service name is resolved from the logical name (eg ns1.nn1) to an IP 
 address at the outer layer of DFSHAAdmin
 -- this means it's difficult to provide the logical name ns1.nn1 to fence 
 scripts (HDFS-3084)
 -- this means it's difficult to configure fencing method per-namespace (since 
 the FailoverController doesn't know what the namespace is) (HADOOP-8077)
 - the configuration for HA HDFS is weirdly split between core-site and 
 hdfs-site, even though most users see this as an HDFS feature. For example, 
 users expect to configure NN fencing configurations in hdfs-site, and expect 
 the keys to have a dfs.* prefix
 - proxies are constructed at the outer layer of the admin commands. This 
 means it's impossible for the inner layers (eg FailoverController.failover) 
 to re-construct proxies with different timeouts (HDFS-3072)
 The proposed refactor is to add a new interface (tentatively named 
 HAServiceTarget) which refers to target for one of the admin commands. An 
 instance of this class is responsible for creating proxies, creating fencers, 
 mapping back to a logical name, etc. The HDFS implementation of this class 
 can then provide different results based on the particular nameservice, can 
 use HDFS-specific configuration prefixes, etc. Using this class as the 
 argument for fencing methods also makes the API more evolvable in the future, 
 since we can add new getters to HAServiceTarget (whereas the current 
 InetSocketAddress is quite limiting)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-22 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235937#comment-13235937
]

Todd Lipcon commented on HADOOP-8163:
-

Hi Bikas. To be clear, I did not remove any of your test cases. I just cleaned
it up to be implemented much more simply. It looked like you had some confusion
about the semantics of inner classes, etc -- eg using static variables where
unnecessary, etc (iirc you are new to Java, so perfectly understandable!). All
of the same corner cases you tested are still tested, just with fewer lines of
code and fitting our normal coding conventions.

Improve ActiveStandbyElector to provide hooks for fencing old active

Key: HADOOP-8163
URL: https://issues.apache.org/jira/browse/HADOOP-8163
Project: Hadoop Common
Issue Type: Improvement
Components: ha
Affects Versions: 0.24.0, 0.23.3
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Attachments: hadoop-8163.txt, hadoop-8163.txt

[jira] [Commented] (HADOOP-8060) Add a capability to use of consistent checksums for append and copy

2012-03-22 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235984#comment-13235984
 ] 

Todd Lipcon commented on HADOOP-8060:
-

Hi Kihwal. What about making the checksum type part of the FileSystem cache key 
(like we do for UGI?) It seems like we would have similar problems with 
configurable timeouts, etc.

 Add a capability to use of consistent checksums for append and copy
 ---

 Key: HADOOP-8060
 URL: https://issues.apache.org/jira/browse/HADOOP-8060
 Project: Hadoop Common
  Issue Type: Bug
  Components: fs, util
Affects Versions: 0.23.0, 0.24.0, 0.23.1
Reporter: Kihwal Lee
Assignee: Kihwal Lee
 Fix For: 0.24.0, 0.23.2


 After the improved CRC32C checksum feature became default, some of use cases 
 involving data movement are no longer supported.  For example, when running 
 DistCp to copy from a file stored with the CRC32 checksum to a new cluster 
 with the CRC32C set to default checksum, the final data integrity check fails 
 because of mismatch in checksums.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-22 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236116#comment-13236116
]

Todd Lipcon commented on HADOOP-8163:
-

bq. Am I missing something, or are ensureBaseZNode and baseNodeExists only
called by the tests? If so, we should probably relocate them, or at least mark
them @VisibleForTesting if they can't be moved for some reason.
These are used by my forthcoming patch for the ZK-based automatic failover
controller. The ZKFC has a -formatZK flag which calls through to
ensureBaseZNode. Once this gets committed I'll move forward uploading the patch
there.

I fixed the other three of ATM's comments. I'll wait til tomorrow to commit
this in case Bikas has any additional feedback.

Improve ActiveStandbyElector to provide hooks for fencing old active

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-22 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236213#comment-13236213
 ] 

Todd Lipcon commented on HADOOP-8163:
-

ilePath. zkMostRecentFilePath is open to being misunderstood. Same for 
MOST_RECENT_FILENAME.
Done

bq. actually MostRecent seems to be a misnomer to me. I think it actually is 
LockOwnerInfo/LeaderInfo. zkLockOwnerInfoPath/tryDeleteLeaderInfo etc.

It's not always the lock owner, though. Basically, we go through the following 
states:

||Time step||Lock node||MostRecentActive||Description||
|1|-|-|Startup|
|2|Node A|-|Node A acquires active lock
|3|Node A|Node A|..and writes its own info|
|4|-|Node A|A loses its ZK lease|
|5|Node B|Node A|Node B acquires active lock
|6|Node B|-|Node B fences node A|
|7|Node B|Node B|Node B writes its info|

So, in steps 3 and 7, calling it LeaderInfo or LockOwnerInfo makes sense. 
But in steps 4 and 5, it's the PreviousLeaderInfo.

Perhaps just renaming to LeaderBreadcrumb or something makes more sense, 
since it's basically a bread crumb left around by the previous leader so that 
future leaders know its info.

bq. why is ensureBaseNode() needed? In it we are creating a new set of znodes 
with the given zkAcls which may or may not be the correct thing. eg. if the 
admin simply forgot to create the appropriate znode path before starting the 
service it might be ok to fail. Instead of trying to create the path ourselves 
with permissions that may or may not be appropriate for the entire path. I 
would be wary of doing this. What is the use case?

The use case is a ZKFailoverController -formatZK command line tool that I'm 
building into the ZKFC code. The thinking is that most administrators won't 
want to go into the ZK CLI to manually create the parent znode while installing 
HDFS. Instead, they'd rather just issue this simple command. In the case that 
they want to have varying permissions across the path, or some more complicated 
ACL, then they'll have to use the ZK CLI, but for the common case I think this 
will make deployment much simpler.

bq. consider renaming baseNodeExists() to parentNodeExists() or renaming the 
parentZnodeName parameter in the constructor to baseNode for consistency. 
Perhaps this could be called in the constructor to check that the znode exists 
and be done with config issues. No need for ensureBaseNode() above.

Renamed to parentZNodeExists and ensureParentZNode

bq. this must be my newbie java skills but I find something like - 
prefixPath.append(/).append(pathParts[index]) or znodeWorkingDir.subString(0, 
znodeWorkingDir.nextIndexOf('/')) - more readable than prefixPath = 
Joiner.on(/).join(Arrays.asList(pathParts).subList(0, i)). It might also be 
more efficient but thats not relevant for this situation.
Agreed, fixed.

bq. public synchronized void quitElection(boolean needFence) - Dont we want to 
delete the permanent znode for standby's too? Why check if state is active. It 
anyways calls a tryDelete* method that should be harmless.

If the node is standby, then the permanent znode refers to the current 
lockholder. So deleting it would incorrectly signify that whoever is active 
doesn't need to be fenced if it crashes.

bq. tryDeleteMostRecentNode() - From my understanding of tryFunction - this 
function should be not really be asserting that some state holds. If it should 
assert then we should remove try from the name.

The difference here is this: the assert() guards against programmer error. It 
is a mistake to call this function when you aren't active (see above comment). 
But if there is a ZK error (like the session got lost) it's OK to fail to 
delete it, since it just means that the node will get fenced.

bq. in zkDoWithRetries there is a NUM_RETRIES field that could be used instead 
of 3.
Fixed

bq. why are we exposing public synchronized ZooKeeper getZKClient()? 
Removed

bq. the following code seems to have issues... snip... While that is 
happening, the state of the world changes and this elector is not longer the 
lock owner. When appClient.fenceOldActive(data) will complete then the code 
will go ahead and delete the lockOwnerZnode at zkMostRecentFilePath. This node 
could be from the new leader who had successfully fenced and become active. The 
version number parameter might accidentally save us but would likely be 0 all 
the time.

This scenario is impossible for the following reason: If the state of the world 
changed and this node was no longer active, the only possible reason for that 
is that the node lost its ZK session lease. If that's the case, then it won't 
be able to issue any further commands from that client (see my conversation 
with Hari above)

bq. what happens if the leader lost the lock, tried to delete its znode, failed 
to do so, exited anyways, then became the next owner and found the existing 
mostrecent znode. I think it will try to fence itself

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-22 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236276#comment-13236276
]

Todd Lipcon commented on HADOOP-8163:
-

bq. So paranoid admin deletes the lock hoping a new master might solve this
If the admin is mucking about in ZK, then all bets are off. The proper thing
for the admin to do is to kill B's failover controller, not to go delete a
znode.

bq. Yes. I am suggesting to do this within the Elector and not at the
ZKFailoverController level. The self compare approach would be reasonable as
long as we can assure ourselves that appData will not be same across different
candidates
K, that's the approach in the latest patch I uploaded.

Improve ActiveStandbyElector to provide hooks for fencing old active

[jira] [Commented] (HADOOP-8157) TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE

2012-03-21 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235019#comment-13235019
 ] 

Todd Lipcon commented on HADOOP-8157:
-

I think I understand this bug. It's probably due to an error in HADOOP-6502. 
Patch and explanation en route.

 TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE
 -

 Key: HADOOP-8157
 URL: https://issues.apache.org/jira/browse/HADOOP-8157
 Project: Hadoop Common
  Issue Type: Test
Affects Versions: 0.24.0
Reporter: Eli Collins
Assignee: Todd Lipcon

 Saw TestRPCCallBenchmark#testBenchmarkWithWritable fail with the following on 
 jenkins:
 Caused by: java.lang.RuntimeException: IPC server unable to read call 
 parameters: readObject can't find class java.lang.String

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-20 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233638#comment-13233638
]

Todd Lipcon commented on HADOOP-8163:
-

Hi Hari. I like your ideas about using this info znode for failover/restart
preferences. But I don't think it's a requirement for a first draft, and it's
not clear what you mean by 'state equalization' in your second point. We don't
currently use this terminology.

Are you OK with the current design for a first draft? We can add improvements
later -- I'm using a protobuf for the info in ZK so we can evolve the
information contained within without breaking compatibility.

Improve ActiveStandbyElector to provide hooks for fencing old active

[jira] [Commented] (HADOOP-8183) Stop using mapred.used.genericoptionsparser to avoid unnecessary warnings

2012-03-19 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232777#comment-13232777
 ] 

Todd Lipcon commented on HADOOP-8183:
-

+1

 Stop using mapred.used.genericoptionsparser to avoid unnecessary warnings
 ---

 Key: HADOOP-8183
 URL: https://issues.apache.org/jira/browse/HADOOP-8183
 Project: Hadoop Common
  Issue Type: Improvement
  Components: util
Affects Versions: 0.23.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor
 Attachments: HADOOP-8183.patch


 Its about time we stopped the following from appearing in 0.23/trunk:
 {code}
 12/03/19 20:53:51 WARN conf.Configuration: mapred.used.genericoptionsparser 
 is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8151) Error handling in snappy decompressor throws invalid exceptions

2012-03-18 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232225#comment-13232225
 ] 

Todd Lipcon commented on HADOOP-8151:
-

+1, patch looks good to me. Please upload a trunk patch as well.

 Error handling in snappy decompressor throws invalid exceptions
 ---

 Key: HADOOP-8151
 URL: https://issues.apache.org/jira/browse/HADOOP-8151
 Project: Hadoop Common
  Issue Type: Bug
  Components: io, native
Affects Versions: 0.24.0, 1.0.2
Reporter: Todd Lipcon
Assignee: Matt Foley
 Attachments: HADOOP-8151-branch-1.0.patch


 SnappyDecompressor.c has the following code in a few places:
 {code}
 THROW(env, Ljava/lang/InternalError, Could not decompress data. Buffer 
 length is too small.);
 {code}
 this is incorrect, though, since the THROW macro doesn't need the L before 
 the class name. This results in a ClassNotFoundException for 
 Ljava.lang.InternalError being thrown, instead of the intended exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-14 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229589#comment-13229589
]

Todd Lipcon commented on HADOOP-8163:
-

bq. The question I had was how is the info znode creation prevented when the
client does not have the ephemeral lock znode? Is this ensured in the zk client
or at the zookeeper?

This is ensured by ZooKeeper. The only reason the ephemeral node would
disappear is if the session was expired. This means the leader has marked the
session as such -- and thus, you can no longer issue commands under that same
session.

To be sure, I just double checked with Pat Hunt from the ZK team. Apparently
there was a rare race condition bug ZOOKEEPER-1208 fixed in 3.3.4/3.4.0 about
this exact case:
https://issues.apache.org/jira/browse/ZOOKEEPER-1208?focusedCommentId=13149787page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13149787
... but since Hadoop will probably need the krb5 auth from ZK 3.4, it seems a
reasonable requirement to need at least that version.

Improve ActiveStandbyElector to provide hooks for fencing old active

[jira] [Commented] (HADOOP-8163) Improve ActiveStandbyElector to provide hooks for fencing old active

2012-03-12 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228043#comment-13228043
 ] 

Todd Lipcon commented on HADOOP-8163:
-

The design here is pretty simple:

*In ZK*:
- add an additional znode (the info znode) next to the lock znode, which is 
a PERSISTENT node with the same data.

*Upon successfully acquiring the lock znode:*
- check if there exists an info znode
-- if so, the previous active did not exit cleanly. Call an 
application-provided fencing hook, providing the data from the info znode
-- If the fencing hook succeeds, delete the info znode
- create an info znode with one's own app data
- proceed to call the {{becomeActive}} API on the app

*Upon crashing:*
- the ephemeral node disappears
- by the order of events above, if the application has become active, then it 
will have created an info znode so whoever recovers knows to fence it

*Upon graceful exit:*
- first transition out of active mode (e.g. shutdown the NN)
- then delete the info node
- then close the session (deleting the ephemeral node)



 Improve ActiveStandbyElector to provide hooks for fencing old active
 

 Key: HADOOP-8163
 URL: https://issues.apache.org/jira/browse/HADOOP-8163
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ha
Affects Versions: 0.24.0, 0.23.3
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 When a new node becomes active in an HA setup, it may sometimes have to take 
 fencing actions against the node that was formerly active. This JIRA extends 
 the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK 
 directory, which acts as a second copy of the active node's information. 
 Then, if the active loses its ZK session, the next active to be elected may 
 easily locate the unfenced node to take the appropriate actions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7788) HA: Simple HealthMonitor class to watch an HAService

2012-03-12 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228100#comment-13228100
 ] 

Todd Lipcon commented on HADOOP-7788:
-

Oh, sorry, I also left in the main() method. Though the test covers the code 
fairly well, having a main() method is helpful for manual testing of some 
things like kill -STOPping the monitored process and making sure timeouts are 
handled correctly, etc. That's hard to mock out.

 HA: Simple HealthMonitor class to watch an HAService
 

 Key: HADOOP-7788
 URL: https://issues.apache.org/jira/browse/HADOOP-7788
 Project: Hadoop Common
  Issue Type: New Feature
  Components: ha
Affects Versions: 0.24.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-7788.txt, hdfs-2524.txt


 This is a utility class which will be part of the FailoverController. The 
 class starts a daemon thread which periodically monitors an HAService, 
 calling its monitorHealth function. It then generates callbacks into another 
 class when the health status changes (eg the RPC fails or the service returns 
 a HealthCheckFailedException)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8154) DNS#getIPs shouldn't silently return the local host IP for bogus interface names

2012-03-08 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225587#comment-13225587
 ] 

Todd Lipcon commented on HADOOP-8154:
-

Under what circumstances would the following code trigger?
{code}
+} catch (SocketException e) {
+  LOG.warn(I/O error finding interface  + strInterface +
+  :  + e.getMessage());
{code}

Seems strange that we fallback to the default there, but throw an exception if 
we specify an invalid one.

 DNS#getIPs shouldn't silently return the local host IP for bogus interface 
 names
 

 Key: HADOOP-8154
 URL: https://issues.apache.org/jira/browse/HADOOP-8154
 Project: Hadoop Common
  Issue Type: Bug
  Components: conf
Reporter: Eli Collins
Assignee: Eli Collins
 Attachments: hadoop-8154.txt


 DNS#getIPs silently returns the local host IP for bogus interface names. In 
 this case let's throw an UnknownHostException. This is technically an 
 incompatbile change. I suspect the current behavior was origininally 
 introduced so the interface name default works w/o explicitly checking for 
 it. It may also be used in cases where someone is using a shared config file 
 and an option like dfs.datanode.dns.interface or 
 hbase.master.dns.interface and eg interface eth3 that some hosts don't 
 have, though I think silently ignorning this is the wrong behavior (those 
 hosts should be configured to use a different interface).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7806) [DNS] Support binding to sub-interfaces

2012-03-08 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225768#comment-13225768
 ] 

Todd Lipcon commented on HADOOP-7806:
-

+1 pending results on new patch

 [DNS] Support binding to sub-interfaces
 ---

 Key: HADOOP-7806
 URL: https://issues.apache.org/jira/browse/HADOOP-7806
 Project: Hadoop Common
  Issue Type: New Feature
  Components: util
Affects Versions: 0.24.0
Reporter: Harsh J
Assignee: Harsh J
 Fix For: 0.24.0

 Attachments: HADOOP-7806.patch, HADOOP-7806.patch, hadoop-7806.txt


 Right now, with the {{DNS}} class, we can look up IPs of provided interface 
 names ({{eth0}}, {{vm1}}, etc.). However, it would be useful if the I/F - IP 
 lookup also took a look at subinterfaces ({{eth0:1}}, etc.) and allowed 
 binding to only a specified subinterface / virtual interface.
 This should be fairly easy to add, by matching against all available 
 interfaces' subinterfaces via Java.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8157) TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE

2012-03-08 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225873#comment-13225873
 ] 

Todd Lipcon commented on HADOOP-8157:
-

This failure is super-goofy. My hunch is it's something to do with 
non-threadsafe use of classloaders or some other bad synchronization, but I 
don't have much to go on. Any ideas?

 TestRPCCallBenchmark#testBenchmarkWithWritable fails with RTE
 -

 Key: HADOOP-8157
 URL: https://issues.apache.org/jira/browse/HADOOP-8157
 Project: Hadoop Common
  Issue Type: Test
Affects Versions: 0.24.0
Reporter: Eli Collins

 Saw TestRPCCallBenchmark#testBenchmarkWithWritable fail with the following on 
 jenkins:
 Caused by: java.lang.RuntimeException: IPC server unable to read call 
 parameters: readObject can't find class java.lang.String

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8151) Error handling in snappy decompressor throws invalid exceptions

2012-03-07 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224848#comment-13224848
 ] 

Todd Lipcon commented on HADOOP-8151:
-

This bug seems to occur in lz4 as well.

It also seems like the wrong kind of exception to throw - InternalError is for 
JVM-internal unexpected conditions.

 Error handling in snappy decompressor throws invalid exceptions
 ---

 Key: HADOOP-8151
 URL: https://issues.apache.org/jira/browse/HADOOP-8151
 Project: Hadoop Common
  Issue Type: Bug
  Components: io, native
Affects Versions: 0.24.0, 1.0.2
Reporter: Todd Lipcon

 SnappyDecompressor.c has the following code in a few places:
 {code}
 THROW(env, Ljava/lang/InternalError, Could not decompress data. Buffer 
 length is too small.);
 {code}
 this is incorrect, though, since the THROW macro doesn't need the L before 
 the class name. This results in a ClassNotFoundException for 
 Ljava.lang.InternalError being thrown, instead of the intended exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8153) Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up by job submission.

2012-03-07 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224908#comment-13224908
 ] 

Todd Lipcon commented on HADOOP-8153:
-

Looks like we need to override FileSystem.getCanonicalServiceName in 
DistributedFileSystem so that the canonical name is just the logical name, for 
the case of HA HDFS file systems.

 Fail to submit mapred job on a secured-HA-HDFS: logic URI cannot be picked up 
 by job submission.
 

 Key: HADOOP-8153
 URL: https://issues.apache.org/jira/browse/HADOOP-8153
 Project: Hadoop Common
  Issue Type: Bug
  Components: ha, security
Affects Versions: 0.24.0
Reporter: Mingjie Lai
 Fix For: 0.24.0


 When testing the combination of NN HA + security + yarn, I found that the 
 mapred job submission cannot pick up the logic URI of a nameservice. 
 I have logic URI configured in core-site.xml
 {code}
 property
  namefs.defaultFS/name
  valuehdfs://ns1/value
 /property
 {code}
 HDFS client can work with the HA deployment/configs:
 {code}
 [root@nn1 hadoop]# hdfs dfs -ls /
 Found 6 items
 drwxr-xr-x   - hbase  hadoop  0 2012-03-07 20:42 /hbase
 drwxrwxrwx   - yarn   hadoop  0 2012-03-07 20:42 /logs
 drwxr-xr-x   - mapred hadoop  0 2012-03-07 20:42 /mapred
 drwxr-xr-x   - mapred hadoop  0 2012-03-07 20:42 /mr-history
 drwxrwxrwt   - hdfs   hadoop  0 2012-03-07 21:57 /tmp
 drwxr-xr-x   - hdfs   hadoop  0 2012-03-07 20:42 /user
 {code}
 but cannot submit a mapred job with security turned on
 {code}
 [root@nn1 hadoop]# /usr/lib/hadoop/bin/yarn --config ./conf jar 
 share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar 
 randomwriter out
 Running 0 maps.
 Job started: Wed Mar 07 23:28:23 UTC 2012
 java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1
   at 
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:431)
   at 
 org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:312)
   at 
 org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:217)
   at 
 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:119)
   at 
 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
   at 
 org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
   at 
 org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
   at 
 org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:411)
   at 
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:326)
   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1221)
   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218)
 
 {code}0.24

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8135) Add ByteBufferReadable interface to FSDataInputStream

2012-03-02 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221334#comment-13221334
 ] 

Todd Lipcon commented on HADOOP-8135:
-

{code}
+   * @return - the number of bytes available to read from buf
{code}
style nit: no '-' here

- maybe worth noting in the javadoc that many FS implementations may throw 
UnsupportedOperationException, and add that to the javadoc as well

 Add ByteBufferReadable interface to FSDataInputStream
 -

 Key: HADOOP-8135
 URL: https://issues.apache.org/jira/browse/HADOOP-8135
 Project: Hadoop Common
  Issue Type: New Feature
  Components: fs
Reporter: Henry Robinson
Assignee: Henry Robinson
 Attachments: HADOOP-8135.patch


 To prepare for HDFS-2834, it's useful to add an interface to 
 FSDataInputStream (and others inside hdfs) that adds a read(ByteBuffer...) 
 method as follows:
 {code}
   /**
* Reads up to buf.remaining() bytes into buf. Callers should use
* buf.limit(..) to control the size of the desired read.
* 
* After the call, buf.position() should be unchanged, and therefore any 
 data
* can be immediately read from buf.
* 
* @param buf
* @return - the number of bytes available to read from buf
* @throws IOException
*/
   public int read(ByteBuffer buf) throws IOException;
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8104) Inconsistent Jackson versions

2012-02-22 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214192#comment-13214192
 ] 

Todd Lipcon commented on HADOOP-8104:
-

Will this now break HBase or other projects which also use Jersey? HBase 
appears to use jersey 1.4.

 Inconsistent Jackson versions
 -

 Key: HADOOP-8104
 URL: https://issues.apache.org/jira/browse/HADOOP-8104
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Attachments: HADOOP-8104.patch


 This is a maven build issue.
 Jersey 1.8 is pulling in version 1.7.1 of Jackson.  Meanwhile, we are 
 manually specifying that we want version 1.8 of Jackson in the POM files.  
 This causes a conflict where Jackson produces unexpected results when 
 serializing Map objects.
 How to reproduce: try this code:
 {quote}
 ObjectMapper mapper = new ObjectMapper();
  MapString, Object m = new HashMapString, Object();
 mapper.writeValue(new File(foo), m);
 {quote}
 You will get an exception:
 {quote}
 Exception in thread main java.lang.NoSuchMethodError: 
 org.codehaus.jackson.type.JavaType.isMapLikeType()Z
 at 
 org.codehaus.jackson.map.ser.BasicSerializerFactory.buildContainerSerializer(BasicSerializerFactory.java:396)
 at 
 org.codehaus.jackson.map.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:267)
 {quote}
 Basically the inconsistent versions of various Jackson components are causing 
 this NoSuchMethod error.
 As far as I know, this only occurs when serializing maps-- that's why it 
 hasn't been found and fixed yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8097) TestRPCCallBenchmark failing w/ port in use -handling badly

2012-02-21 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212751#comment-13212751
 ] 

Todd Lipcon commented on HADOOP-8097:
-

I'm not sure this is the best fix (relying on a different static port). A few 
other ideas:
- change the benchmark so that if a port isn't specified, it binds to port 0, 
and then has the clients connect to whichever port gets bound
- make sure it uses REUSEADDR so that it can still bind despite the TIME_WAIT 
sockets

Either of those make sense? I honestly thought I'd written it to use port 0 but 
apparently I didn't :)

 TestRPCCallBenchmark failing w/ port in use -handling badly
 ---

 Key: HADOOP-8097
 URL: https://issues.apache.org/jira/browse/HADOOP-8097
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Affects Versions: 0.24.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
 Fix For: 0.24.0

 Attachments: HADOOP-8097.patch


 I'm seeing TestRPCCallBenchmark fail with port in use, which is probably 
 related to some other test (race condition on shutdown?), but which isn't 
 being handled that well in the test itself -although the log shows the 
 binding exception, the test is failing on a connection timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8093) HadoopRpcRequestProto should not be serialize twice

2012-02-20 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13212307#comment-13212307
 ] 

Todd Lipcon commented on HADOOP-8093:
-

This seems like a dup of HADOOP-8084, but the implementation in 8084 actually 
avoids one more copy than this.

 HadoopRpcRequestProto should not be serialize twice
 ---

 Key: HADOOP-8093
 URL: https://issues.apache.org/jira/browse/HADOOP-8093
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ipc
Affects Versions: 0.24.0, 0.23.2
 Environment: Windows 7
Reporter: Changming Sun
 Attachments: HADOOP-8093.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 @Override
 public void write(DataOutput out) throws IOException {
   out.writeInt(message.toByteArray().length);
   out.write(message.toByteArray());
 }
 The code is not effective.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8066) The full docs build intermittently fails

2012-02-15 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208684#comment-13208684
 ] 

Todd Lipcon commented on HADOOP-8066:
-

This is a regression, right? Any chance we could revert the commit that 
introduced it while we figure out the solution? Or introduce a workaround even 
if it's temporary and slows the build? It's bad to not get the nightly test 
results anymore.

 The full docs build intermittently fails
 

 Key: HADOOP-8066
 URL: https://issues.apache.org/jira/browse/HADOOP-8066
 Project: Hadoop Common
  Issue Type: Bug
  Components: build
Affects Versions: 0.24.0
Reporter: Aaron T. Myers
Assignee: Andrew Bayer

 See for example:
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/954/
 https://builds.apache.org/job/Hadoop-Common-trunk/317/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8069) Enable TCP_NODELAY by default for IPC

2012-02-15 Thread Todd Lipcon (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209012#comment-13209012
 ] 

Todd Lipcon commented on HADOOP-8069:
-

Hi Daryn. Your above descriptions sound right, except the nagle delay on Linux 
is 40ms rather than 200 (I think the dack delay is 200 though like you said).
I hacked up something like my #4 yesterday morning but didn't really like the 
way I did it so I threw it away. I'll try again soon :)

 Enable TCP_NODELAY by default for IPC
 -

 Key: HADOOP-8069
 URL: https://issues.apache.org/jira/browse/HADOOP-8069
 Project: Hadoop Common
  Issue Type: Improvement
  Components: ipc
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Attachments: hadoop-8069.txt


 I think we should switch the default for the IPC client and server NODELAY 
 options to true. As wikipedia says:
 {quote}
 In general, since Nagle's algorithm is only a defense against careless 
 applications, it will not benefit a carefully written application that takes 
 proper care of buffering; the algorithm has either no effect, or negative 
 effect on the application.
 {quote}
 Since our IPC layer is well contained and does its own buffering, we 
 shouldn't be careless.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8069) Enable TCP_NODELAY by default for IPC

2012-02-15 Thread Todd Lipcon (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HADOOP-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209150#comment-13209150
]

Todd Lipcon commented on HADOOP-8069:
-

My hunch is that it's pretty small. I think the only RPC to the NN which would
be at all frequent and cross the 8K boundary would be getListing(). On one
production hbase cluster I collected metrics from a while back, getListing
represented 8.3% of the RPCs. On one of our QA clusters that's been running MR
workloads, it represents 2.3%. Unfortunately we don't have enough metrics to
get any info on the size distribution of those responses.

Would be interested to hear if some of your production clusters show a similar
mix.

Enable TCP_NODELAY by default for IPC
-

Key: HADOOP-8069
URL: https://issues.apache.org/jira/browse/HADOOP-8069
Project: Hadoop Common
Issue Type: Improvement
Components: ipc
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Attachments: hadoop-8069.txt

I think we should switch the default for the IPC client and server NODELAY
options to true. As wikipedia says:
{quote}
In general, since Nagle's algorithm is only a defense against careless
applications, it will not benefit a carefully written application that takes
proper care of buffering; the algorithm has either no effect, or negative
effect on the application.
{quote}
Since our IPC layer is well contained and does its own buffering, we
shouldn't be careless.

1 2 >

1 - 100 of 194 matches

Mail list logo