Re: Looking to a Hadoop 3 release
IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. This time around we are not replacing the guts as we did from Hadoop 1 to Hadoop 2, but superficial surgery to address issues were not considered (or was too much to take on top of the guts transplant). For the split brain concern, we did a great of job maintaining Hadoop 1 and Hadoop 2 until Hadoop 1 faded away. Based on that experience I would say that the coexistence of Hadoop 2 and Hadoop 3 will be much less demanding/traumatic. Also, to facilitate the coexistence we should limit Java language features to Java 7 (even if the runtime is Java 8), once Java 7 is not used anymore we can remove this limitation. Thanks. On Thu, Mar 5, 2015 at 11:40 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: The 'resistance' is not so much about a new major release, more so about the content and the roadmap of the release. Other than the two specific features raised (the need for breaking compat for them is something that I am debating), I haven't seen a roadmap of branch-3 about any more features that this community needs to discuss about. If all the difference between branch-2 and branch-3 is going to be JDK + a couple of incompat changes, it is a big problem in two dimensions (1) it's a burden keeping the branches in sync and avoiding the split-brain we experienced with 1.x, 2.x or worse branch-0.23, branch-2 and (2) very hard to ask people to not break more things in branch-3. We seem to have agreed upon a course of action for JDK7. And now we are taking a different direction for JDK8. Going by this new proposal, come 2016, we will have to deal with JDK9 and 3 mainline incompatible hadoop releases. Regarding, individual improvements like classpath isolation, shell script stuff, Jason Lowe captured it perfectly on HADOOP-11656 - it should be possible for every major feature that we develop to be a opt in, unless the change is so great and users can balance out the incompatibilities for the new stuff they are getting. Even with an ground breaking change like with YARN, we spent a bit of time to ensure compatibility (MAPREDUCE-5108) that has paid so many times over in return. Breaking compatibility shouldn't come across as too cheap a thing. Thanks, +Vinod On Mar 4, 2015, at 10:15 AM, Andrew Wang andrew.w...@cloudera.commailto: andrew.w...@cloudera.com wrote: Where does this resistance to a new major release stem from? As I've described from the beginning, this will look basically like a 2.x release, except for the inclusion of classpath isolation by default and target version JDK8. I've expressed my desire to maintain API and wire compatibility, and we can audit the set of incompatible changes in trunk to ensure this. My proposal for doing alpha and beta releases leading up to GA also gives downstreams a nice amount of time for testing and validation.
Re: Looking to a Hadoop 3 release
Moving to JDK8 involves a lot of things (1) Get Hadoop apps to be able to run on JDK8 and chose JDK8 language features. This is already possible with the decoupling of apps from the platform. (2) Get the platform to run on JDK8. This can be done so that we can run Hadoop on both JDK8 and JDK7 without any compatibility issues. This in itself is a huge move, what with potential GC behavior changes, native library compat etc. (3) Get the platform to use JDK8 language features. As much as I love the new stuff in JDK8, I'm willing to postpone usage of the language features in the platform till the time when JDK8 is already in full force. So, how about we do (1) + (2) for now, get JDK8 going and then come around to make the decision of dropping support for JDK7? This is no different from what we did for the adoption of JDK7. For a bit of time (2/3 releases?), we were able to run on both JDK6 and JDK7 and we are phasing out JDK6 only when most of the community stopped using it. Thanks, +Vinod On Mar 2, 2015, at 8:08 PM, Andrew Wang andrew.w...@cloudera.com wrote: Given that we already agreed to put in JDK7 in 2.7, and that the classpath is a fairly minor irritant given some existing solutions (e.g. a new default classloader), how do you quantify the benefit for users? I looked at our thread on this topic from last time, and we (meaning at least myself and Tucu) agreed to a one-time exception to the JDK7 bump in 2.x for practical reasons. We waited for so long that we had some assurance JDK6 was on the outs. Multiple distros also already had bumped their min version to JDK7. This is not true this time around. Bumping the JDK version is hugely impactful on the end user, and my email on the earlier thread still reflects my thoughts on JDK compatibility: http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201406.mbox/%3CCAGB5D2a5fEDfBApQyER_zyhc8a4Xd_ea1wJSsxxkiAiDZO9%2BNg%40mail.gmail.com%3E . Right now, the incompatible changes would be JDK8, classpath isolation, and whatever is already in trunk. I can audit these existing trunk changes when branch-3 is cut.
Re: Looking to a Hadoop 3 release
The 'resistance' is not so much about a new major release, more so about the content and the roadmap of the release. Other than the two specific features raised (the need for breaking compat for them is something that I am debating), I haven't seen a roadmap of branch-3 about any more features that this community needs to discuss about. If all the difference between branch-2 and branch-3 is going to be JDK + a couple of incompat changes, it is a big problem in two dimensions (1) it's a burden keeping the branches in sync and avoiding the split-brain we experienced with 1.x, 2.x or worse branch-0.23, branch-2 and (2) very hard to ask people to not break more things in branch-3. We seem to have agreed upon a course of action for JDK7. And now we are taking a different direction for JDK8. Going by this new proposal, come 2016, we will have to deal with JDK9 and 3 mainline incompatible hadoop releases. Regarding, individual improvements like classpath isolation, shell script stuff, Jason Lowe captured it perfectly on HADOOP-11656 - it should be possible for every major feature that we develop to be a opt in, unless the change is so great and users can balance out the incompatibilities for the new stuff they are getting. Even with an ground breaking change like with YARN, we spent a bit of time to ensure compatibility (MAPREDUCE-5108) that has paid so many times over in return. Breaking compatibility shouldn't come across as too cheap a thing. Thanks, +Vinod On Mar 4, 2015, at 10:15 AM, Andrew Wang andrew.w...@cloudera.commailto:andrew.w...@cloudera.com wrote: Where does this resistance to a new major release stem from? As I've described from the beginning, this will look basically like a 2.x release, except for the inclusion of classpath isolation by default and target version JDK8. I've expressed my desire to maintain API and wire compatibility, and we can audit the set of incompatible changes in trunk to ensure this. My proposal for doing alpha and beta releases leading up to GA also gives downstreams a nice amount of time for testing and validation.
Re: 2.7 status
The 2.7 blocker JIRA went down and going back up again, we will need to converge. Unless I see objections, I plan to cut a branch this weekend and selectively filter stuff in after that in the interest of convergence. Thoughts welcome! Thanks, +Vinod On Mar 1, 2015, at 11:58 AM, Arun Murthy a...@hortonworks.com wrote: Sounds good, thanks for the help Vinod! Arun From: Vinod Kumar Vavilapalli Sent: Sunday, March 01, 2015 11:43 AM To: Hadoop Common; Jason Lowe; Arun Murthy Subject: Re: 2.7 status Agreed. How about we roll an RC end of this week? As a Java 7+ release with features, patches that already got in? Here's a filter tracking blocker tickets - https://issues.apache.org/jira/issues/?filter=12330598. Nine open now. +Arun Arun, I'd like to help get 2.7 out without further delay. Do you mind me taking over release duties? Thanks, +Vinod From: Jason Lowe jl...@yahoo-inc.com.INVALID Sent: Friday, February 13, 2015 8:11 AM To: common-...@hadoop.apache.org Subject: Re: 2.7 status I'd like to see a 2.7 release sooner than later. It has been almost 3 months since Hadoop 2.6 was released, and there have already been 634 JIRAs committed to 2.7. That's a lot of changes waiting for an official release. https://issues.apache.org/jira/issues/?jql=project%20in%20%28hadoop%2Chdfs%2Cyarn%2Cmapreduce%29%20AND%20fixversion%3D2.7.0%20AND%20resolution%3DFixed Jason From: Sangjin Lee sj...@apache.org To: common-...@hadoop.apache.org common-...@hadoop.apache.org Sent: Tuesday, February 10, 2015 1:30 PM Subject: 2.7 status Folks, What is the current status of the 2.7 release? I know initially it started out as a java-7 only release, but looking at the JIRAs that is very much not the case. Do we have a certain timeframe for 2.7 or is it time to discuss it? Thanks, Sangjin
Re: Looking to a Hadoop 3 release
I'm OK with a 3.0.0 release as long as we are minimizing the pain of maintaining yet another release line and conscious of the incompatibilities going into that release line. For the former, I would really rather not see a branch-3 cut so soon. It's yet another line onto which to cherry-pick, and I don't see why we need to add this overhead at such an early phase. We should only create branch-3 when there's an incompatible change that the community wants and it should _not_ go into the next major release (i.e.: it's for Hadoop 4.0). We can develop 3.0 alphas and betas on trunk and release from trunk in the interim. IMHO we need to stop treating trunk as a place to exile patches. For the latter, I think as a community we need to evaluate the benefits of breaking compatibility against the costs of migrating. Each time we break compatibility we create a hurdle for people to jump when they move to the new release, and we should make those hurdles worth their time. For example, wire-compatibility has been mentioned as part of this. Any feature that breaks wire compatibility better be absolutely amazing, as it creates a huge hurdle for people to jump. To summarize:+1 for a community-discussed roadmap of what we're breaking in Hadoop 3 and why it's worth it for users -1 for creating branch-3 now, we can release from trunk until the next incompatibility for Hadoop 4 arrives +1 for baking classpath isolation as opt-in on 2.x and eventually default on in 3.0 Jason From: Andrew Wang andrew.w...@cloudera.com To: hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.org Cc: common-...@hadoop.apache.org common-...@hadoop.apache.org; mapreduce-dev@hadoop.apache.org mapreduce-dev@hadoop.apache.org; yarn-...@hadoop.apache.org yarn-...@hadoop.apache.org Sent: Wednesday, March 4, 2015 12:15 PM Subject: Re: Looking to a Hadoop 3 release Let's not dismiss this quite so handily. Sean, Jason, and Stack replied on HADOOP-11656 pointing out that while we could make classpath isolation opt-in via configuration, what we really want longer term is to have it on by default (or just always on). Stack in particular points out the practical difficulties in using an opt-in method in 2.x from a downstream project perspective. It's not pretty. The plan that both Sean and Jason propose (which I support) is to have an opt-in solution in 2.x, bake it there, then turn it on by default (incompatible) in a new major release. I think this lines up well with my proposal of some alphas and betas leading up to a GA 3.x. I'm also willing to help with 2.x release management if that would help with testing this feature. Even setting aside classpath isolation, a new major release is still justified by JDK8. Somehow this is being ignored in the discussion. Allen, historically the voice of the user in our community, just highlighted it as a major compatibility issue, and myself and Tucu have also expressed our very strong concerns about bumping this in a minor release. 2.7's bump is a unique exception, but this is not something to be cited as precedent or policy. Where does this resistance to a new major release stem from? As I've described from the beginning, this will look basically like a 2.x release, except for the inclusion of classpath isolation by default and target version JDK8. I've expressed my desire to maintain API and wire compatibility, and we can audit the set of incompatible changes in trunk to ensure this. My proposal for doing alpha and beta releases leading up to GA also gives downstreams a nice amount of time for testing and validation. Regards, Andrew On Tue, Mar 3, 2015 at 2:32 PM, Arun Murthy a...@hortonworks.com wrote: Awesome, looks like we can just do this in a compatible manner - nothing else on the list seems like it warrants a (premature) major release. Thanks Vinod. Arun From: Vinod Kumar Vavilapalli vino...@hortonworks.com Sent: Tuesday, March 03, 2015 2:30 PM To: common-...@hadoop.apache.org Cc: hdfs-...@hadoop.apache.org; mapreduce-dev@hadoop.apache.org; yarn-...@hadoop.apache.org Subject: Re: Looking to a Hadoop 3 release I started pitching in more on that JIRA. To add, I think we can and should strive for doing this in a compatible manner, whatever the approach. Marking and calling it incompatible before we see proposal/patch seems premature to me. Commented the same on JIRA: https://issues.apache.org/jira/browse/HADOOP-11656?focusedCommentId=14345875page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14345875 . Thanks +Vinod On Mar 2, 2015, at 8:08 PM, Andrew Wang andrew.w...@cloudera.commailto: andrew.w...@cloudera.com wrote: Regarding classpath isolation, based on what I hear from our customers, it's still a big problem (even after the MR classloader work). The latest Jackson version bump was quite painful for our downstream projects, and the HDFS client still leaks a lot
Re: Looking to a Hadoop 3 release
On 05/03/2015 13:05, Alejandro Abdelnur tuc...@gmail.commailto:tuc...@gmail.com wrote: IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. Looking @ 2.x, the major work (HDFS HA, YARN) meant that it did take a long time to get out, and during that time 0.21, 0.22, got released and ignored; 0.23 picked up and used in production. The 2.04-alpha release was more of a troublespot as it got picked up widely enough to be used in products, and changes were made between that alpha 2.2 itself which raised compatibility issues. For 3.x I'd propose 1. Have less longevity of 3.x alpha/beta artifacts 2. Make clear there are no guarantees of compatibility from alpha/beta releases to shipping. Best effort, but not to the extent that it gets in the way. More succinctly: we will care more about seamless migration from 2.2+ to 3.x than from a 3.0-alpha to 3.3 production. 3. Anybody who ships code based on 3.x alpha/beta to recognise and accept policy (2). Hadoop's instability guarantee for the 3.x alpha/beta phase As well as backwards compatibility, we need to think about Forwards compatibility, with the goal being: Any app written/shipped with the 3.x release binaries (JAR and native) will work against a 3.y Hadoop release, for all x, y in Natural where y=x and is-release(x) and is-release(y) That's important, as it means all server-side changes in 3.x which are expected to to mandate client-side updates: protocols, HDFS erasure decoding, security features, must be considered complete and stable before we can say is-release(x). In an ideal world, we'll even get the semantics right with tests to show this. Fixing classpath hell downstream is certainly one feature I am +1 on this roadmap is classpath isolation. But: it's only one of the features, and given there's not any design doc on that JIRA, way too immature to set a release schedule on. An alpha schedule with no-guarantees and a regular alpha roll, could be viable, as new features go in and can then be used to experimentally try this stuff in branches of Hbase (well volunteered, Stack!), etc. Of course instability guarantees will transitive This time around we are not replacing the guts as we did from Hadoop 1 to Hadoop 2, but superficial surgery to address issues were not considered (or was too much to take on top of the guts transplant). For the split brain concern, we did a great of job maintaining Hadoop 1 and Hadoop 2 until Hadoop 1 faded away. And a significant argument about 2.0.4-alpha to 2.2 protobuf/HDFS compatibility. Based on that experience I would say that the coexistence of Hadoop 2 and Hadoop 3 will be much less demanding/traumatic. The re-layout of all the source trees was a major change there, assuming there's no refactoring or switch of build tools then picking things back will be tractable Also, to facilitate the coexistence we should limit Java language features to Java 7 (even if the runtime is Java 8), once Java 7 is not used anymore we can remove this limitation. +1; setting javac.version will fix this What is nice about having java 8 as the base JVM is that it means you can be confident that all Hadoop 3 servers will be JDK8+, so downstream apps and libs can use all Java 8 features they want to. There's one policy change to consider there which is possibly, just possibly, we could allow new modules in hadoop-tools to adopt Java 8 languages early, provided everyone recognised that backport to branch-2 isn't going to happen. -Steve
Re: Looking to a Hadoop 3 release
Sorry, outlook dequoted Alejandros's comments. Let me try again with his comments in italic and proofreading of mine On 05/03/2015 13:59, Steve Loughran ste...@hortonworks.commailto:ste...@hortonworks.com wrote: On 05/03/2015 13:05, Alejandro Abdelnur tuc...@gmail.commailto:tuc...@gmail.commailto:tuc...@gmail.com wrote: IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. Looking @ 2.x, the major work (HDFS HA, YARN) meant that it did take a long time to get out, and during that time 0.21, 0.22, got released and ignored; 0.23 picked up and used in production. The 2.04-alpha release was more of a troublespot as it got picked up widely enough to be used in products, and changes were made between that alpha 2.2 itself which raised compatibility issues. For 3.x I'd propose 1. Have less longevity of 3.x alpha/beta artifacts 2. Make clear there are no guarantees of compatibility from alpha/beta releases to shipping. Best effort, but not to the extent that it gets in the way. More succinctly: we will care more about seamless migration from 2.2+ to 3.x than from a 3.0-alpha to 3.3 production. 3. Anybody who ships code based on 3.x alpha/beta to recognise and accept policy (2). Hadoop's instability guarantee for the 3.x alpha/beta phase As well as backwards compatibility, we need to think about Forwards compatibility, with the goal being: Any app written/shipped with the 3.x release binaries (JAR and native) will work in and against a 3.y Hadoop cluster, for all x, y in Natural where y=x and is-release(x) and is-release(y) That's important, as it means all server-side changes in 3.x which are expected to to mandate client-side updates: protocols, HDFS erasure decoding, security features, must be considered complete and stable before we can say is-release(x). In an ideal world, we'll even get the semantics right with tests to show this. Fixing classpath hell downstream is certainly one feature I am +1 on. But: it's only one of the features, and given there's not any design doc on that JIRA, way too immature to set a release schedule on. An alpha schedule with no-guarantees and a regular alpha roll, could be viable, as new features go in and can then be used to experimentally try this stuff in branches of Hbase (well volunteered, Stack!), etc. Of course instability guarantees will be transitive downstream. This time around we are not replacing the guts as we did from Hadoop 1 to Hadoop 2, but superficial surgery to address issues were not considered (or was too much to take on top of the guts transplant). For the split brain concern, we did a great of job maintaining Hadoop 1 and Hadoop 2 until Hadoop 1 faded away. And a significant argument about 2.0.4-alpha to 2.2 protobuf/HDFS compatibility. Based on that experience I would say that the coexistence of Hadoop 2 and Hadoop 3 will be much less demanding/traumatic. The re-layout of all the source trees was a major change there, assuming there's no refactoring or switch of build tools then picking things back will be tractable Also, to facilitate the coexistence we should limit Java language features to Java 7 (even if the runtime is Java 8), once Java 7 is not used anymore we can remove this limitation. +1; setting javac.version will fix this What is nice about having java 8 as the base JVM is that it means you can be confident that all Hadoop 3 servers will be JDK8+, so downstream apps and libs can use all Java 8 features they want to. There's one policy change to consider there which is possibly, just possibly, we could allow new modules in hadoop-tools to adopt Java 8 languages early, provided everyone recognised that backport to branch-2 isn't going to happen. -Steve
Re: Looking to a Hadoop 3 release
I think it'll be useful to have a discussion about what else people would like to see in Hadoop 3.x - especially if the change is potentially incompatible. Also, what we expect the release schedule to be for major releases and what triggers them - JVM version, major features, the need for incompatible changes ? Assuming major versions will not be released every 6 months/1 year (adoption time, fairly disruptive for downstream projects, and users) - considering additional features/incompatible changes for 3.x would be useful. Some features that come to mind immediately would be 1) enhancements to the RPC mechanics - specifically support for AsynRPC / two way communication. There's a lot of places where we re-use heartbeats to send more information than what would be done if the PRC layer supported these features. Some of this can be done in a compatible manner to the existing RPC sub-system. Others like 2 way communication probably cannot. After this, having HDFS/YARN actually make use of these changes. The other consideration is adoption of an alternate system ike gRpc which would be incompatible. 2) Simplification of configs - potentially separating client side configs and those used by daemons. This is another source of perpetual confusion for users. Thanks - Sid On Thu, Mar 5, 2015 at 2:46 PM, Steve Loughran ste...@hortonworks.com wrote: Sorry, outlook dequoted Alejandros's comments. Let me try again with his comments in italic and proofreading of mine On 05/03/2015 13:59, Steve Loughran ste...@hortonworks.commailto: ste...@hortonworks.com wrote: On 05/03/2015 13:05, Alejandro Abdelnur tuc...@gmail.commailto: tuc...@gmail.commailto:tuc...@gmail.com wrote: IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. Looking @ 2.x, the major work (HDFS HA, YARN) meant that it did take a long time to get out, and during that time 0.21, 0.22, got released and ignored; 0.23 picked up and used in production. The 2.04-alpha release was more of a troublespot as it got picked up widely enough to be used in products, and changes were made between that alpha 2.2 itself which raised compatibility issues. For 3.x I'd propose 1. Have less longevity of 3.x alpha/beta artifacts 2. Make clear there are no guarantees of compatibility from alpha/beta releases to shipping. Best effort, but not to the extent that it gets in the way. More succinctly: we will care more about seamless migration from 2.2+ to 3.x than from a 3.0-alpha to 3.3 production. 3. Anybody who ships code based on 3.x alpha/beta to recognise and accept policy (2). Hadoop's instability guarantee for the 3.x alpha/beta phase As well as backwards compatibility, we need to think about Forwards compatibility, with the goal being: Any app written/shipped with the 3.x release binaries (JAR and native) will work in and against a 3.y Hadoop cluster, for all x, y in Natural where y=x and is-release(x) and is-release(y) That's important, as it means all server-side changes in 3.x which are expected to to mandate client-side updates: protocols, HDFS erasure decoding, security features, must be considered complete and stable before we can say is-release(x). In an ideal world, we'll even get the semantics right with tests to show this. Fixing classpath hell downstream is certainly one feature I am +1 on. But: it's only one of the features, and given there's not any design doc on that JIRA, way too immature to set a release schedule on. An alpha schedule with no-guarantees and a regular alpha roll, could be viable, as new features go in and can then be used to experimentally try this stuff in branches of Hbase (well volunteered, Stack!), etc. Of course instability guarantees will be transitive downstream. This time around we are not replacing the guts as we did from Hadoop 1 to Hadoop 2, but superficial surgery to address issues were not considered (or was too much to take on top of the guts transplant). For the split brain concern, we did a great of job maintaining Hadoop 1 and Hadoop 2 until Hadoop 1 faded away. And a significant argument about 2.0.4-alpha to 2.2 protobuf/HDFS compatibility. Based on that experience I would say that the coexistence of Hadoop 2 and Hadoop 3 will be much less demanding/traumatic. The re-layout of all the source trees was a major change there, assuming there's no refactoring or switch of build tools then picking things back will be tractable Also, to facilitate the coexistence we should limit Java language features to Java 7 (even if the runtime is Java 8), once Java 7 is not used anymore we can remove this limitation. +1; setting javac.version will fix this What is nice about having java 8 as the base JVM is that it means you can be confident that all Hadoop 3 servers will be
Re: Reviving HADOOP-7435: Making Jenkins pre-commit build work with branches
Tx for the feedback! Let's continue on JIRA, but I'd definitely welcome as much help as is available. Thanks, +Vinod On Mar 4, 2015, at 3:30 PM, Zhijie Shen zs...@hortonworks.com wrote: +1. It¹s really helpful for branch development. To continue Karthik¹s point, is it good make pre-commit testing against branch-2 as the default too like that against trunk? On 3/4/15, 1:47 PM, Sean Busbey bus...@cloudera.com wrote: +1 If we can make things look like HBase support for precommit testing on branches (HBASE-12944), that would make it easier for new and occasional contributors who might end up working in other ecosystem projects. AFAICT, Jonathan's proposal for branch names in patch names does this. On Wed, Mar 4, 2015 at 3:41 PM, Karthik Kambatla ka...@cloudera.com wrote: Thanks for reviving this on email, Vinod. Newer folks like me might not be aware of this JIRA/effort. This would be wonderful to have so (1) we know the status of release branches (branch-2, etc.) and also (2) feature branches (YARN-2928). Jonathan's or Matt's proposal for including branch name looks reasonable to me. If none has any objections, I think we can continue on JIRA and get this in. On Wed, Mar 4, 2015 at 1:20 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Hi all, I'd like us to revive the effort at https://issues.apache.org/jira/browse/HADOOP-7435 to make precommit builds being able to work with branches. Having the Jenkins verify patches on branches is very useful even if there may be relaxed review oversight on the said-branch. Unless there are objections, I'd request help from Giri who already has a patch sitting there for more than a year before. This may need us to collectively agree on some convention - the last comment says that the branch patch name should be in some format for this to work. Thanks, +Vinod -- Karthik Kambatla Software Engineer, Cloudera Inc. http://five.sentenc.es -- Sean
Re: Looking to a Hadoop 3 release
If classloader isolation is in place, then dependency versions can freely be upgraded as won't pollute apps space (things get trickier if there is an ON/OFF switch). On Thu, Mar 5, 2015 at 9:21 PM, Allen Wittenauer a...@altiscale.com wrote: Is there going to be a general upgrade of dependencies? I'm thinking of jetty jackson in particular. On Mar 5, 2015, at 5:24 PM, Andrew Wang andrew.w...@cloudera.com wrote: I've taken the liberty of adding a Hadoop 3 section to the Roadmap wiki page. In addition to the two things I've been pushing, I also looked through Allen's list (thanks Allen for making this) and picked out the shell script rewrite and the removal of HFTP as big changes. This would be the place to propose features for inclusion in 3.x, I'd particularly appreciate help on the YARN/MR side. Based on what I'm hearing, let me modulate my proposal to the following: - We avoid cutting branch-3, and release off of trunk. The trunk-only changes don't look that scary, so I think this is fine. This does mean we need to be more rigorous before merging branches to trunk. I think Vinod/Giri's work on getting test-patch.sh runs on non-trunk branches would be very helpful in this regard. - We do not include anything to break wire compatibility unless (as Jason says) it's an unbelievably awesome feature. - No harm in rolling alphas from trunk, as it doesn't lock us to anything compatibility wise. Downstreams like releases. I'll take Steve's advice about not locking GA to a given date, but I also share his belief that we can alpha/beta/GA faster than it took for Hadoop 2. Let's roll some intermediate releases, work on the roadmap items, and see how we're feeling in a few months. Best, Andrew On Thu, Mar 5, 2015 at 3:21 PM, Siddharth Seth ss...@apache.org wrote: I think it'll be useful to have a discussion about what else people would like to see in Hadoop 3.x - especially if the change is potentially incompatible. Also, what we expect the release schedule to be for major releases and what triggers them - JVM version, major features, the need for incompatible changes ? Assuming major versions will not be released every 6 months/1 year (adoption time, fairly disruptive for downstream projects, and users) - considering additional features/incompatible changes for 3.x would be useful. Some features that come to mind immediately would be 1) enhancements to the RPC mechanics - specifically support for AsynRPC / two way communication. There's a lot of places where we re-use heartbeats to send more information than what would be done if the PRC layer supported these features. Some of this can be done in a compatible manner to the existing RPC sub-system. Others like 2 way communication probably cannot. After this, having HDFS/YARN actually make use of these changes. The other consideration is adoption of an alternate system ike gRpc which would be incompatible. 2) Simplification of configs - potentially separating client side configs and those used by daemons. This is another source of perpetual confusion for users. Thanks - Sid On Thu, Mar 5, 2015 at 2:46 PM, Steve Loughran ste...@hortonworks.com wrote: Sorry, outlook dequoted Alejandros's comments. Let me try again with his comments in italic and proofreading of mine On 05/03/2015 13:59, Steve Loughran ste...@hortonworks.commailto: ste...@hortonworks.com wrote: On 05/03/2015 13:05, Alejandro Abdelnur tuc...@gmail.commailto: tuc...@gmail.commailto:tuc...@gmail.com wrote: IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. Looking @ 2.x, the major work (HDFS HA, YARN) meant that it did take a long time to get out, and during that time 0.21, 0.22, got released and ignored; 0.23 picked up and used in production. The 2.04-alpha release was more of a troublespot as it got picked up widely enough to be used in products, and changes were made between that alpha 2.2 itself which raised compatibility issues. For 3.x I'd propose 1. Have less longevity of 3.x alpha/beta artifacts 2. Make clear there are no guarantees of compatibility from alpha/beta releases to shipping. Best effort, but not to the extent that it gets in the way. More succinctly: we will care more about seamless migration from 2.2+ to 3.x than from a 3.0-alpha to 3.3 production. 3. Anybody who ships code based on 3.x alpha/beta to recognise and accept policy (2). Hadoop's instability guarantee for the 3.x alpha/beta phase As well as backwards compatibility, we need to think about Forwards compatibility, with the goal being: Any app written/shipped with the 3.x release binaries (JAR and native) will work in and
Re: Looking to a Hadoop 3 release
Thanks all. There is an open issue HDFS-6962 (ACLs inheritance conflicts with umaskmode), for which the incompatibility appears to make it not suitable for 2.x and it's targetted 3.0, please see: https://issues.apache.org/jira/browse/HDFS-6962?focusedCommentId=14335418page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14335418 Best, --Yongjun On Wed, Mar 4, 2015 at 8:13 PM, Allen Wittenauer a...@altiscale.com wrote: One of the questions that keeps popping up is “what exactly is in trunk?” As some may recall, I had done some experiments creating the change log based upon JIRA. While the interest level appeared to be approaching zero, I kept playing with it a bit and eventually also started playing with the release notes script (for various reasons I won’t bore you with.) In any case, I’ve started posting the results of these runs on one of my github repos if anyone was wanting a quick reference as to JIRA’s opinion on the matter: https://github.com/aw-altiscale/hadoop-release-metadata/tree/master/3.0.0
Re: Looking to a Hadoop 3 release
On Mon, Mar 2, 2015 at 11:04 PM, Konstantin Shvachko shv.had...@gmail.com wrote: 2. If Hadoop 3 and 2.x are meant to exist together, we run a risk to manifest split-brain behavior again, as we had with hadoop-1, hadoop-2 and other versions. If that somehow beneficial for commercial vendors, which I don't see how, for the community it was proven to be very disruptive. Would be really good to avoid it this time. Agreed; let's try to minimize backporting headaches. Pulling trunk branch-2 branch-2.x is already tedious. Adding a branch-3, branch-3.x would be obnoxious. 3. Could we release Hadoop 3 directly from trunk? With a proper feature freeze in advance. Current trunk is in the best working condition I've seen in years - much better, than when hadoop-2 was coming to life. It could make a good alpha. +1 This sounds like a good approach. Marked as alpha, we can break compatibility in minor versions. Stabilizing a beta can correspond with cutting branch-3, since that will be winding down branch-2. This shouldn't disrupt existing plans for branch-2. However, this requires that committers not accumulate too much compatibility debt in trunk. Undoing all that in branch-3 imposes a burdensome tax. Scanning through Allen's diff: that doesn't appear to be the case so far, but it recommends against developing features in place on trunk. Just be considerate of users and developers who will need to move from (and maintain) branch-2. I believe we can start planning 3.0 from trunk right after 2.7 is out. If we're publishing a snapshot, we don't need too much planning. -C On Mon, Mar 2, 2015 at 3:19 PM, Andrew Wang andrew.w...@cloudera.com wrote: Hi devs, It's been a year and a half since 2.x went GA, and I think we're about due for a 3.x release. Notably, there are two incompatible changes I'd like to call out, that will have a tremendous positive impact for our users. First, classpath isolation being done at HADOOP-11656, which has been a long-standing request from many downstreams and Hadoop users. Second, bumping the source and target JDK version to JDK8 (related to HADOOP-11090), which is important since JDK7 is EOL in April 2015 (two months from now). In the past, we've had issues with our dependencies discontinuing support for old JDKs, so this will future-proof us. Between the two, we'll also have quite an opportunity to clean up and upgrade our dependencies, another common user and developer request. I'd like to propose that we start rolling a series of monthly-ish series of 3.0 alpha releases ASAP, with myself volunteering to take on the RM and other cat herding responsibilities. There are already quite a few changes slated for 3.0 besides the above (for instance the shell script rewrite) so there's already value in a 3.0 alpha, and the more time we give downstreams to integrate, the better. This opens up discussion about inclusion of other changes, but I'm hoping to freeze incompatible changes after maybe two alphas, do a beta (with no further incompat changes allowed), and then finally a 3.x GA. For those keeping track, that means a 3.x GA in about four months. I would also like to stress though that this is not intended to be a big bang release. For instance, it would be great if we could maintain wire compatibility between 2.x and 3.x, so rolling upgrades work. Keeping branch-2 and branch-3 similar also makes backports easier, since we're likely maintaining 2.x for a while yet. Please let me know any comments / concerns related to the above. If people are friendly to the idea, I'd like to cut a branch-3 and start working on the first alpha. Best, Andrew
Hadoop-Mapreduce-trunk-Java8 - Build # 123 - Still Failing
See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/123/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 10611 lines...] Tests run: 521, Failures: 2, Errors: 0, Skipped: 11 [INFO] [INFO] Reactor Summary: [INFO] [INFO] hadoop-mapreduce-client ... SUCCESS [ 2.031 s] [INFO] hadoop-mapreduce-client-core .. SUCCESS [01:13 min] [INFO] hadoop-mapreduce-client-common SUCCESS [ 24.528 s] [INFO] hadoop-mapreduce-client-shuffle ... SUCCESS [ 4.009 s] [INFO] hadoop-mapreduce-client-app ... SUCCESS [10:15 min] [INFO] hadoop-mapreduce-client-hs SUCCESS [06:06 min] [INFO] hadoop-mapreduce-client-jobclient . FAILURE [ 01:58 h] [INFO] hadoop-mapreduce-client-hs-plugins SKIPPED [INFO] hadoop-mapreduce-client-nativetask SKIPPED [INFO] Apache Hadoop MapReduce Examples .. SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 02:16 h [INFO] Finished at: 2015-03-05T15:24:38+00:00 [INFO] Final Memory: 34M/167M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project hadoop-mapreduce-client-jobclient: There was a timeout or other error in the fork - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :hadoop-mapreduce-client-jobclient Build step 'Execute shell' marked build as failure [FINDBUGS] Skipping publisher since build result is FAILURE Archiving artifacts Sending artifact delta relative to Hadoop-Mapreduce-trunk-Java8 #28 Archived 1 artifacts Archive block size is 32768 Received 0 blocks and 20322434 bytes Compression is 0.0% Took 5 min 39 sec Recording test results Updating YARN-3242 Updating HDFS-7434 Updating HDFS-7879 Updating HADOOP-11648 Updating MAPREDUCE-6267 Updating YARN-3249 Updating MAPREDUCE-6136 Updating HADOOP-11674 Updating HDFS-7746 Updating HDFS-7535 Updating YARN-3131 Updating YARN-3122 Updating YARN-3231 Updating HDFS-1522 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 2 tests failed. FAILED: org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat.testSplitPlacementForCompressedFiles Error Message: expected:2 but was:1 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:1 at junit.framework.Assert.fail(Assert.java:57) at junit.framework.Assert.failNotEquals(Assert.java:329) at junit.framework.Assert.assertEquals(Assert.java:78) at junit.framework.Assert.assertEquals(Assert.java:234) at junit.framework.Assert.assertEquals(Assert.java:241) at junit.framework.TestCase.assertEquals(TestCase.java:409) at org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat.testSplitPlacementForCompressedFiles(TestCombineFileInputFormat.java:911) FAILED: org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat.testSplitPlacement Error Message: expected:2 but was:1 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:1 at junit.framework.Assert.fail(Assert.java:57) at junit.framework.Assert.failNotEquals(Assert.java:329) at junit.framework.Assert.assertEquals(Assert.java:78) at junit.framework.Assert.assertEquals(Assert.java:234) at junit.framework.Assert.assertEquals(Assert.java:241) at junit.framework.TestCase.assertEquals(TestCase.java:409) at org.apache.hadoop.mapreduce.lib.input.TestCombineFileInputFormat.testSplitPlacement(TestCombineFileInputFormat.java:368)
Hadoop-Mapreduce-trunk - Build # 2073 - Failure
See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2073/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 31198 lines...] Tests run: 520, Failures: 0, Errors: 1, Skipped: 11 [INFO] [INFO] Reactor Summary: [INFO] [INFO] hadoop-mapreduce-client ... SUCCESS [ 2.659 s] [INFO] hadoop-mapreduce-client-core .. SUCCESS [01:32 min] [INFO] hadoop-mapreduce-client-common SUCCESS [ 27.780 s] [INFO] hadoop-mapreduce-client-shuffle ... SUCCESS [ 4.589 s] [INFO] hadoop-mapreduce-client-app ... SUCCESS [11:21 min] [INFO] hadoop-mapreduce-client-hs SUCCESS [05:37 min] [INFO] hadoop-mapreduce-client-jobclient . FAILURE [ 01:58 h] [INFO] hadoop-mapreduce-client-hs-plugins SKIPPED [INFO] hadoop-mapreduce-client-nativetask SKIPPED [INFO] Apache Hadoop MapReduce Examples .. SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 02:17 h [INFO] Finished at: 2015-03-05T15:44:04+00:00 [INFO] Final Memory: 34M/760M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.17:test (default-test) on project hadoop-mapreduce-client-jobclient: There was a timeout or other error in the fork - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :hadoop-mapreduce-client-jobclient Build step 'Execute shell' marked build as failure [FINDBUGS] Skipping publisher since build result is FAILURE Archiving artifacts Sending artifact delta relative to Hadoop-Mapreduce-trunk #2072 Archived 1 artifacts Archive block size is 32768 Received 0 blocks and 20312860 bytes Compression is 0.0% Took 5 min 11 sec Recording test results Updating YARN-3242 Updating HDFS-7434 Updating HDFS-7879 Updating HADOOP-11648 Updating MAPREDUCE-6267 Updating YARN-3249 Updating MAPREDUCE-6136 Updating HADOOP-11674 Updating HDFS-7746 Updating HDFS-7535 Updating YARN-3131 Updating YARN-3122 Updating YARN-3231 Updating HDFS-1522 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 1 tests failed. REGRESSION: org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMapreduceJobTimelineServiceEnabled Error Message: Job didn't finish in 30 seconds Stack Trace: java.io.IOException: Job didn't finish in 30 seconds at org.apache.hadoop.mapred.UtilsForTests.runJobSucceed(UtilsForTests.java:622) at org.apache.hadoop.mapred.TestMRTimelineEventHandling.testMapreduceJobTimelineServiceEnabled(TestMRTimelineEventHandling.java:155)
Re: Looking to a Hadoop 3 release
I've taken the liberty of adding a Hadoop 3 section to the Roadmap wiki page. In addition to the two things I've been pushing, I also looked through Allen's list (thanks Allen for making this) and picked out the shell script rewrite and the removal of HFTP as big changes. This would be the place to propose features for inclusion in 3.x, I'd particularly appreciate help on the YARN/MR side. Based on what I'm hearing, let me modulate my proposal to the following: - We avoid cutting branch-3, and release off of trunk. The trunk-only changes don't look that scary, so I think this is fine. This does mean we need to be more rigorous before merging branches to trunk. I think Vinod/Giri's work on getting test-patch.sh runs on non-trunk branches would be very helpful in this regard. - We do not include anything to break wire compatibility unless (as Jason says) it's an unbelievably awesome feature. - No harm in rolling alphas from trunk, as it doesn't lock us to anything compatibility wise. Downstreams like releases. I'll take Steve's advice about not locking GA to a given date, but I also share his belief that we can alpha/beta/GA faster than it took for Hadoop 2. Let's roll some intermediate releases, work on the roadmap items, and see how we're feeling in a few months. Best, Andrew On Thu, Mar 5, 2015 at 3:21 PM, Siddharth Seth ss...@apache.org wrote: I think it'll be useful to have a discussion about what else people would like to see in Hadoop 3.x - especially if the change is potentially incompatible. Also, what we expect the release schedule to be for major releases and what triggers them - JVM version, major features, the need for incompatible changes ? Assuming major versions will not be released every 6 months/1 year (adoption time, fairly disruptive for downstream projects, and users) - considering additional features/incompatible changes for 3.x would be useful. Some features that come to mind immediately would be 1) enhancements to the RPC mechanics - specifically support for AsynRPC / two way communication. There's a lot of places where we re-use heartbeats to send more information than what would be done if the PRC layer supported these features. Some of this can be done in a compatible manner to the existing RPC sub-system. Others like 2 way communication probably cannot. After this, having HDFS/YARN actually make use of these changes. The other consideration is adoption of an alternate system ike gRpc which would be incompatible. 2) Simplification of configs - potentially separating client side configs and those used by daemons. This is another source of perpetual confusion for users. Thanks - Sid On Thu, Mar 5, 2015 at 2:46 PM, Steve Loughran ste...@hortonworks.com wrote: Sorry, outlook dequoted Alejandros's comments. Let me try again with his comments in italic and proofreading of mine On 05/03/2015 13:59, Steve Loughran ste...@hortonworks.commailto: ste...@hortonworks.com wrote: On 05/03/2015 13:05, Alejandro Abdelnur tuc...@gmail.commailto: tuc...@gmail.commailto:tuc...@gmail.com wrote: IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. Looking @ 2.x, the major work (HDFS HA, YARN) meant that it did take a long time to get out, and during that time 0.21, 0.22, got released and ignored; 0.23 picked up and used in production. The 2.04-alpha release was more of a troublespot as it got picked up widely enough to be used in products, and changes were made between that alpha 2.2 itself which raised compatibility issues. For 3.x I'd propose 1. Have less longevity of 3.x alpha/beta artifacts 2. Make clear there are no guarantees of compatibility from alpha/beta releases to shipping. Best effort, but not to the extent that it gets in the way. More succinctly: we will care more about seamless migration from 2.2+ to 3.x than from a 3.0-alpha to 3.3 production. 3. Anybody who ships code based on 3.x alpha/beta to recognise and accept policy (2). Hadoop's instability guarantee for the 3.x alpha/beta phase As well as backwards compatibility, we need to think about Forwards compatibility, with the goal being: Any app written/shipped with the 3.x release binaries (JAR and native) will work in and against a 3.y Hadoop cluster, for all x, y in Natural where y=x and is-release(x) and is-release(y) That's important, as it means all server-side changes in 3.x which are expected to to mandate client-side updates: protocols, HDFS erasure decoding, security features, must be considered complete and stable before we can say is-release(x). In an ideal world, we'll even get the semantics right with tests to show this. Fixing classpath hell downstream is certainly one feature I am +1 on. But: it's
Re: Looking to a Hadoop 3 release
Is there going to be a general upgrade of dependencies? I'm thinking of jetty jackson in particular. On Mar 5, 2015, at 5:24 PM, Andrew Wang andrew.w...@cloudera.com wrote: I've taken the liberty of adding a Hadoop 3 section to the Roadmap wiki page. In addition to the two things I've been pushing, I also looked through Allen's list (thanks Allen for making this) and picked out the shell script rewrite and the removal of HFTP as big changes. This would be the place to propose features for inclusion in 3.x, I'd particularly appreciate help on the YARN/MR side. Based on what I'm hearing, let me modulate my proposal to the following: - We avoid cutting branch-3, and release off of trunk. The trunk-only changes don't look that scary, so I think this is fine. This does mean we need to be more rigorous before merging branches to trunk. I think Vinod/Giri's work on getting test-patch.sh runs on non-trunk branches would be very helpful in this regard. - We do not include anything to break wire compatibility unless (as Jason says) it's an unbelievably awesome feature. - No harm in rolling alphas from trunk, as it doesn't lock us to anything compatibility wise. Downstreams like releases. I'll take Steve's advice about not locking GA to a given date, but I also share his belief that we can alpha/beta/GA faster than it took for Hadoop 2. Let's roll some intermediate releases, work on the roadmap items, and see how we're feeling in a few months. Best, Andrew On Thu, Mar 5, 2015 at 3:21 PM, Siddharth Seth ss...@apache.org wrote: I think it'll be useful to have a discussion about what else people would like to see in Hadoop 3.x - especially if the change is potentially incompatible. Also, what we expect the release schedule to be for major releases and what triggers them - JVM version, major features, the need for incompatible changes ? Assuming major versions will not be released every 6 months/1 year (adoption time, fairly disruptive for downstream projects, and users) - considering additional features/incompatible changes for 3.x would be useful. Some features that come to mind immediately would be 1) enhancements to the RPC mechanics - specifically support for AsynRPC / two way communication. There's a lot of places where we re-use heartbeats to send more information than what would be done if the PRC layer supported these features. Some of this can be done in a compatible manner to the existing RPC sub-system. Others like 2 way communication probably cannot. After this, having HDFS/YARN actually make use of these changes. The other consideration is adoption of an alternate system ike gRpc which would be incompatible. 2) Simplification of configs - potentially separating client side configs and those used by daemons. This is another source of perpetual confusion for users. Thanks - Sid On Thu, Mar 5, 2015 at 2:46 PM, Steve Loughran ste...@hortonworks.com wrote: Sorry, outlook dequoted Alejandros's comments. Let me try again with his comments in italic and proofreading of mine On 05/03/2015 13:59, Steve Loughran ste...@hortonworks.commailto: ste...@hortonworks.com wrote: On 05/03/2015 13:05, Alejandro Abdelnur tuc...@gmail.commailto: tuc...@gmail.commailto:tuc...@gmail.com wrote: IMO, if part of the community wants to take on the responsibility and work that takes to do a new major release, we should not discourage them from doing that. Having multiple major branches active is a standard practice. Looking @ 2.x, the major work (HDFS HA, YARN) meant that it did take a long time to get out, and during that time 0.21, 0.22, got released and ignored; 0.23 picked up and used in production. The 2.04-alpha release was more of a troublespot as it got picked up widely enough to be used in products, and changes were made between that alpha 2.2 itself which raised compatibility issues. For 3.x I'd propose 1. Have less longevity of 3.x alpha/beta artifacts 2. Make clear there are no guarantees of compatibility from alpha/beta releases to shipping. Best effort, but not to the extent that it gets in the way. More succinctly: we will care more about seamless migration from 2.2+ to 3.x than from a 3.0-alpha to 3.3 production. 3. Anybody who ships code based on 3.x alpha/beta to recognise and accept policy (2). Hadoop's instability guarantee for the 3.x alpha/beta phase As well as backwards compatibility, we need to think about Forwards compatibility, with the goal being: Any app written/shipped with the 3.x release binaries (JAR and native) will work in and against a 3.y Hadoop cluster, for all x, y in Natural where y=x and is-release(x) and is-release(y) That's important, as it means all server-side changes in 3.x which are expected to to mandate client-side updates: protocols, HDFS erasure decoding, security features, must be considered complete and stable before we can say