Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-18 Thread Vinayakumar B
+1, I agree with the support for git-merge based workflows for large branch
merge.

I have experienced the pain of re-basing the entire branch HDFS-7285, just
for verification though, and I found even a line change in trunk in core
files ( ex: FSNameSystem.java, BlockManager.java ) makes it hard to rebase
many commits in the branch.
  One main problem, as I have experienced, with git-rebase is,
  If we need to retain same commits, All conflicts should be resolved by
the same person who is doing the rebase, as 'git-rebase' should be executed
 in same machine and there is a fair chance of miss-handling conflicts and
causing problem. The person doing rebase may not be very familiar with the
conflicted code.
  In these kind of situations, I think its very hard to find out what was
the original code and what is conflicted code, once the rebase is done.

IMO, its fair to go with periodic merge from trunk-branch, even though
there are little conflicts, these may not be much problematic, compare to
rebase-conflicts.

   Regarding merging to branch-2, though it needs little more conflict
resolutions compare to trunk, but may not be too much, as trunk and
branch-2 are going parallel, at-least in terms of features and fixes ( ~ 
90% I would say).

Regards,
Vinay

On Tue, Aug 18, 2015 at 6:12 AM, Sangjin Lee sj...@apache.org wrote:

 I also think allowing merges as a way to uprev with trunk would be a good
 idea. AFAIK, git rebase works well when your branch is short-lived and
 contains a fairly small number of commits, but doesn't work so well if your
 branch is large. Also, the cost of rebase will only go up as time goes. On
 the other hand, git merge has a pretty decent chance to succeed, especially
 more so if you merge the trunk often. My 2 cents.

 Sangjin

 On Mon, Aug 17, 2015 at 1:18 PM, Jing Zhao jing.apa...@gmail.com wrote:

  I think we should allow merge-based workflows. I worked and am working in
  several big feature branches, including HDFS-2802 (100 subtasks) and
  HDFS-7285 (currently already  200 subtasks), and tried both the
  merge-based and rebase-based workflows. When the feature change becomes
  big, the rebase will become a big pain, considering a small change in
 trunk
  can cause conflicts for rebasing large number of commits in the feature
  branch. Using git merge to merge trunk changes into the feature branch
 is
  much easier in this case.
 
  Thanks,
  -Jing
 
  On Mon, Aug 17, 2015 at 12:17 PM, Andrew Wang andrew.w...@cloudera.com
  wrote:
 
   Hi all,
  
   I've thought about this topic more over the last week, and felt I
 should
   play devil's advocate for a merge workflow. A few comments:
  
  - The issue of merges polluting history is mainly an issue when
  using
  a github PR workflow, which results in one merge per PR. Clearly
 this
  is
  not okay, but a separate issue from feature branches. We only have a
  handful of merge commits per feature branch.
  - The issue of changes hiding in merge commits can happen when
  resolving
  rebase conflicts too, except it's harder to track. Right now neither
  go
  through code review, which is sketchy. We probably should review
 these
   too,
  and it's easier to review a single merge commit vs. an entire
 rebased
  branch. Merge is also a more natural way of integrating changes from
   trunk,
  since you just resolve all conflicts at once at the end.
  - Merge gives us a linear history on the branch but worse history on
  trunk/branch-2. Rebase has worse history on the branch but a linear
   history
  on trunk/branch-2. This means for quick/small feature branches that
   don't
  have a lot of conflicts, rebase is preferred. For large features
 with
   lots
  of conflicts, merge is preferred. This is basically what we're
 running
   into
  on HDFS-7285.
  - Rebase also comes with increased coordination costs, since public
  history is being rewritten. This is again okay for smaller efforts
   (where
  there are fewer contributors), but more painful with bigger ones.
  There
  have been a number of HDFS-7285 branches created basically as a
 result
   of
  rebase, with corresponding JIRA discussions about where to commit
   things.
  - The issue of a single squashed commit for the branch-2 backport is
  arguably an issue with how we structure our branches. If release
   branches
  forked off of trunk rather than branch-2, we wouldn't have this
   problem. We
  could require branch-2 integration to also happen via git merge. Or
 we
   kick
  trunk out to a feature branch based off of branch-2. Or we shrug and
   keep
  the status quo.
  
   I'd definitely appreciate commentary from others who've worked on
 feature
   branches in git, even in communities outside of Hadoop.
  
   If there is support for allowing merge-based workflows in addition to
   rebase, we'd need to kick off a [VOTE] thread since the last [VOTE]
 only
   allows 

Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-18 Thread Andrew Wang
Sounds like we have a lot of support for also allowing merge workflows. Let
me draft a proper proposal and go through the [DISCUSS] and [VOTE] process.
One thing I think we should amend from the previous [VOTE] is using git
merge --no-ff rather than rebase --onto for branch - trunk integration,
since it makes reverting the branch easier. Also using git merge rather
than a squashed commit for the branch-2 backport as Vinay said.

In the meantime, I think it's okay for ongoing feature branch development
like HDFS-7285 to start using merge rather than rebase. Haven't seen any
objections to merge yet.

On Tue, Aug 18, 2015 at 1:39 AM, Vinayakumar B vinayakum...@apache.org
wrote:

 +1, I agree with the support for git-merge based workflows for large branch
 merge.

 I have experienced the pain of re-basing the entire branch HDFS-7285, just
 for verification though, and I found even a line change in trunk in core
 files ( ex: FSNameSystem.java, BlockManager.java ) makes it hard to rebase
 many commits in the branch.
   One main problem, as I have experienced, with git-rebase is,
   If we need to retain same commits, All conflicts should be resolved by
 the same person who is doing the rebase, as 'git-rebase' should be executed
  in same machine and there is a fair chance of miss-handling conflicts and
 causing problem. The person doing rebase may not be very familiar with the
 conflicted code.
   In these kind of situations, I think its very hard to find out what was
 the original code and what is conflicted code, once the rebase is done.

 IMO, its fair to go with periodic merge from trunk-branch, even though
 there are little conflicts, these may not be much problematic, compare to
 rebase-conflicts.

Regarding merging to branch-2, though it needs little more conflict
 resolutions compare to trunk, but may not be too much, as trunk and
 branch-2 are going parallel, at-least in terms of features and fixes ( ~ 
 90% I would say).

 Regards,
 Vinay

 On Tue, Aug 18, 2015 at 6:12 AM, Sangjin Lee sj...@apache.org wrote:

  I also think allowing merges as a way to uprev with trunk would be a good
  idea. AFAIK, git rebase works well when your branch is short-lived and
  contains a fairly small number of commits, but doesn't work so well if
 your
  branch is large. Also, the cost of rebase will only go up as time goes.
 On
  the other hand, git merge has a pretty decent chance to succeed,
 especially
  more so if you merge the trunk often. My 2 cents.
 
  Sangjin
 
  On Mon, Aug 17, 2015 at 1:18 PM, Jing Zhao jing.apa...@gmail.com
 wrote:
 
   I think we should allow merge-based workflows. I worked and am working
 in
   several big feature branches, including HDFS-2802 (100 subtasks) and
   HDFS-7285 (currently already  200 subtasks), and tried both the
   merge-based and rebase-based workflows. When the feature change becomes
   big, the rebase will become a big pain, considering a small change in
  trunk
   can cause conflicts for rebasing large number of commits in the feature
   branch. Using git merge to merge trunk changes into the feature
 branch
  is
   much easier in this case.
  
   Thanks,
   -Jing
  
   On Mon, Aug 17, 2015 at 12:17 PM, Andrew Wang 
 andrew.w...@cloudera.com
   wrote:
  
Hi all,
   
I've thought about this topic more over the last week, and felt I
  should
play devil's advocate for a merge workflow. A few comments:
   
   - The issue of merges polluting history is mainly an issue when
   using
   a github PR workflow, which results in one merge per PR. Clearly
  this
   is
   not okay, but a separate issue from feature branches. We only
 have a
   handful of merge commits per feature branch.
   - The issue of changes hiding in merge commits can happen when
   resolving
   rebase conflicts too, except it's harder to track. Right now
 neither
   go
   through code review, which is sketchy. We probably should review
  these
too,
   and it's easier to review a single merge commit vs. an entire
  rebased
   branch. Merge is also a more natural way of integrating changes
 from
trunk,
   since you just resolve all conflicts at once at the end.
   - Merge gives us a linear history on the branch but worse history
 on
   trunk/branch-2. Rebase has worse history on the branch but a
 linear
history
   on trunk/branch-2. This means for quick/small feature branches
 that
don't
   have a lot of conflicts, rebase is preferred. For large features
  with
lots
   of conflicts, merge is preferred. This is basically what we're
  running
into
   on HDFS-7285.
   - Rebase also comes with increased coordination costs, since
 public
   history is being rewritten. This is again okay for smaller efforts
(where
   there are fewer contributors), but more painful with bigger ones.
   There
   have been a number of HDFS-7285 branches created basically as a
  result
of
   rebase, with 

Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-18 Thread Sangjin Lee
One other (long shot) option might be to do git cherry-picks of all new
*trunk* commits into the feature branch when you uprev. But I'm not sure if
that will be a sustainable practice, given the number of commits that are
happening on the trunk. Unless you're upreving very often (e.g. daily),
this could also get out of hand.

On Tue, Aug 18, 2015 at 11:33 AM, Andrew Wang andrew.w...@cloudera.com
wrote:

 Sounds like we have a lot of support for also allowing merge workflows. Let
 me draft a proper proposal and go through the [DISCUSS] and [VOTE] process.
 One thing I think we should amend from the previous [VOTE] is using git
 merge --no-ff rather than rebase --onto for branch - trunk integration,
 since it makes reverting the branch easier. Also using git merge rather
 than a squashed commit for the branch-2 backport as Vinay said.

 In the meantime, I think it's okay for ongoing feature branch development
 like HDFS-7285 to start using merge rather than rebase. Haven't seen any
 objections to merge yet.

 On Tue, Aug 18, 2015 at 1:39 AM, Vinayakumar B vinayakum...@apache.org
 wrote:

  +1, I agree with the support for git-merge based workflows for large
 branch
  merge.
 
  I have experienced the pain of re-basing the entire branch HDFS-7285,
 just
  for verification though, and I found even a line change in trunk in core
  files ( ex: FSNameSystem.java, BlockManager.java ) makes it hard to
 rebase
  many commits in the branch.
One main problem, as I have experienced, with git-rebase is,
If we need to retain same commits, All conflicts should be resolved by
  the same person who is doing the rebase, as 'git-rebase' should be
 executed
   in same machine and there is a fair chance of miss-handling conflicts
 and
  causing problem. The person doing rebase may not be very familiar with
 the
  conflicted code.
In these kind of situations, I think its very hard to find out what was
  the original code and what is conflicted code, once the rebase is done.
 
  IMO, its fair to go with periodic merge from trunk-branch, even though
  there are little conflicts, these may not be much problematic, compare to
  rebase-conflicts.
 
 Regarding merging to branch-2, though it needs little more conflict
  resolutions compare to trunk, but may not be too much, as trunk and
  branch-2 are going parallel, at-least in terms of features and fixes ( ~
 
  90% I would say).
 
  Regards,
  Vinay
 
  On Tue, Aug 18, 2015 at 6:12 AM, Sangjin Lee sj...@apache.org wrote:
 
   I also think allowing merges as a way to uprev with trunk would be a
 good
   idea. AFAIK, git rebase works well when your branch is short-lived and
   contains a fairly small number of commits, but doesn't work so well if
  your
   branch is large. Also, the cost of rebase will only go up as time goes.
  On
   the other hand, git merge has a pretty decent chance to succeed,
  especially
   more so if you merge the trunk often. My 2 cents.
  
   Sangjin
  
   On Mon, Aug 17, 2015 at 1:18 PM, Jing Zhao jing.apa...@gmail.com
  wrote:
  
I think we should allow merge-based workflows. I worked and am
 working
  in
several big feature branches, including HDFS-2802 (100 subtasks) and
HDFS-7285 (currently already  200 subtasks), and tried both the
merge-based and rebase-based workflows. When the feature change
 becomes
big, the rebase will become a big pain, considering a small change in
   trunk
can cause conflicts for rebasing large number of commits in the
 feature
branch. Using git merge to merge trunk changes into the feature
  branch
   is
much easier in this case.
   
Thanks,
-Jing
   
On Mon, Aug 17, 2015 at 12:17 PM, Andrew Wang 
  andrew.w...@cloudera.com
wrote:
   
 Hi all,

 I've thought about this topic more over the last week, and felt I
   should
 play devil's advocate for a merge workflow. A few comments:

- The issue of merges polluting history is mainly an issue
 when
using
a github PR workflow, which results in one merge per PR. Clearly
   this
is
not okay, but a separate issue from feature branches. We only
  have a
handful of merge commits per feature branch.
- The issue of changes hiding in merge commits can happen when
resolving
rebase conflicts too, except it's harder to track. Right now
  neither
go
through code review, which is sketchy. We probably should review
   these
 too,
and it's easier to review a single merge commit vs. an entire
   rebased
branch. Merge is also a more natural way of integrating changes
  from
 trunk,
since you just resolve all conflicts at once at the end.
- Merge gives us a linear history on the branch but worse
 history
  on
trunk/branch-2. Rebase has worse history on the branch but a
  linear
 history
on trunk/branch-2. This means for quick/small feature branches
  that
 don't
have a lot 

Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Sangjin Lee
I also think allowing merges as a way to uprev with trunk would be a good
idea. AFAIK, git rebase works well when your branch is short-lived and
contains a fairly small number of commits, but doesn't work so well if your
branch is large. Also, the cost of rebase will only go up as time goes. On
the other hand, git merge has a pretty decent chance to succeed, especially
more so if you merge the trunk often. My 2 cents.

Sangjin

On Mon, Aug 17, 2015 at 1:18 PM, Jing Zhao jing.apa...@gmail.com wrote:

 I think we should allow merge-based workflows. I worked and am working in
 several big feature branches, including HDFS-2802 (100 subtasks) and
 HDFS-7285 (currently already  200 subtasks), and tried both the
 merge-based and rebase-based workflows. When the feature change becomes
 big, the rebase will become a big pain, considering a small change in trunk
 can cause conflicts for rebasing large number of commits in the feature
 branch. Using git merge to merge trunk changes into the feature branch is
 much easier in this case.

 Thanks,
 -Jing

 On Mon, Aug 17, 2015 at 12:17 PM, Andrew Wang andrew.w...@cloudera.com
 wrote:

  Hi all,
 
  I've thought about this topic more over the last week, and felt I should
  play devil's advocate for a merge workflow. A few comments:
 
 - The issue of merges polluting history is mainly an issue when
 using
 a github PR workflow, which results in one merge per PR. Clearly this
 is
 not okay, but a separate issue from feature branches. We only have a
 handful of merge commits per feature branch.
 - The issue of changes hiding in merge commits can happen when
 resolving
 rebase conflicts too, except it's harder to track. Right now neither
 go
 through code review, which is sketchy. We probably should review these
  too,
 and it's easier to review a single merge commit vs. an entire rebased
 branch. Merge is also a more natural way of integrating changes from
  trunk,
 since you just resolve all conflicts at once at the end.
 - Merge gives us a linear history on the branch but worse history on
 trunk/branch-2. Rebase has worse history on the branch but a linear
  history
 on trunk/branch-2. This means for quick/small feature branches that
  don't
 have a lot of conflicts, rebase is preferred. For large features with
  lots
 of conflicts, merge is preferred. This is basically what we're running
  into
 on HDFS-7285.
 - Rebase also comes with increased coordination costs, since public
 history is being rewritten. This is again okay for smaller efforts
  (where
 there are fewer contributors), but more painful with bigger ones.
 There
 have been a number of HDFS-7285 branches created basically as a result
  of
 rebase, with corresponding JIRA discussions about where to commit
  things.
 - The issue of a single squashed commit for the branch-2 backport is
 arguably an issue with how we structure our branches. If release
  branches
 forked off of trunk rather than branch-2, we wouldn't have this
  problem. We
 could require branch-2 integration to also happen via git merge. Or we
  kick
 trunk out to a feature branch based off of branch-2. Or we shrug and
  keep
 the status quo.
 
  I'd definitely appreciate commentary from others who've worked on feature
  branches in git, even in communities outside of Hadoop.
 
  If there is support for allowing merge-based workflows in addition to
  rebase, we'd need to kick off a [VOTE] thread since the last [VOTE] only
  allows rebase.
 
  Best,
  Andrew
 
  On Mon, Aug 17, 2015 at 11:33 AM, Andrew Wang andrew.w...@cloudera.com
  wrote:
 
   @Sangjin,
  
   I believe this is covered by the [VOTE] I linked to above, key excerpt
   being:
  
  3. Force-push on feature-branches is allowed. Before pulling in a
  feature, the feature-branch should be rebased on latest trunk and
 the
  changes applied to trunk through git rebase --onto or git
  cherry-pick
  commit-range.
  
   This specifies that the last uprev final integration of the branch into
  trunk happen with rebase. It doesn't say anything about the periodic
  uprev's, but it'd be very strange to merge periodically and then rebase
  once at the end. So I take it to mean doing periodic uprevs with rebase
 too.
  
  
   On Mon, Aug 17, 2015 at 11:23 AM, Sangjin Lee sj...@apache.org
 wrote:
  
   Just to be clear, are we discussing the process of uprev'ing the
 feature
   development branch with the latest from the trunk from time to time,
 or
   making the final merge of the feature branch onto the trunk?
  
   On Mon, Aug 17, 2015 at 10:21 AM, Steve Loughran 
  ste...@hortonworks.com
   wrote:
  
I haven't done a bit piece of work in the ASF code repo since the
migration to git; though I have done it in the svn era.
   
   
Currently with private git repos
-anyone gets SCM control of their source
-you can commit for your own reasons (about to make 

Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Steve Loughran
I haven't done a bit piece of work in the ASF code repo since the migration to 
git; though I have done it in the svn era.


Currently with private git repos
-anyone gets SCM control of their source
-you can commit for your own reasons (about to make a change, want a private 
jenkins run, ...) and gain from having many small checkins. More succinctly: if 
you aren't checking in your work 2+ times a day —why not?
-rebasing a painful necessity on personal, private branches to keep the final 
patch to hadoop git a single diff

With the private git process that's the defacto standard, we lose history 
anyway. I know what I've done and somewhere there's a tag in my own github repo 
of my work to create a JIRA. But we don't always need that entire history of 
trying to debug kerberos, typo in exception, and other stuff that accrues 
during the work.

I think therefore that I'm in favour of big squash commits. What we could do is 
extend that with a policy of


  1.  tag the final commit used to make the patch, something like 
tag_HADOOP-8192. The tag ensures that the history isn't gc'd
  2.  Delete the branch (keeps the #of branches down)
  3.  In the JIRA, include the name of the tag and the git commit number in the 
comments. Someone curious can rebuild that history



Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Sangjin Lee
Just to be clear, are we discussing the process of uprev'ing the feature
development branch with the latest from the trunk from time to time, or
making the final merge of the feature branch onto the trunk?

On Mon, Aug 17, 2015 at 10:21 AM, Steve Loughran ste...@hortonworks.com
wrote:

 I haven't done a bit piece of work in the ASF code repo since the
 migration to git; though I have done it in the svn era.


 Currently with private git repos
 -anyone gets SCM control of their source
 -you can commit for your own reasons (about to make a change, want a
 private jenkins run, ...) and gain from having many small checkins. More
 succinctly: if you aren't checking in your work 2+ times a day —why not?
 -rebasing a painful necessity on personal, private branches to keep the
 final patch to hadoop git a single diff

 With the private git process that's the defacto standard, we lose history
 anyway. I know what I've done and somewhere there's a tag in my own github
 repo of my work to create a JIRA. But we don't always need that entire
 history of trying to debug kerberos, typo in exception, and other stuff
 that accrues during the work.

 I think therefore that I'm in favour of big squash commits. What we could
 do is extend that with a policy of


   1.  tag the final commit used to make the patch, something like
 tag_HADOOP-8192. The tag ensures that the history isn't gc'd
   2.  Delete the branch (keeps the #of branches down)
   3.  In the JIRA, include the name of the tag and the git commit number
 in the comments. Someone curious can rebuild that history




Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Andrew Wang
@Sangjin,

I believe this is covered by the [VOTE] I linked to above, key excerpt
being:

   3. Force-push on feature-branches is allowed. Before pulling in a
   feature, the feature-branch should be rebased on latest trunk and the
   changes applied to trunk through git rebase --onto or git cherry-pick
   commit-range.

This specifies that the last uprev final integration of the branch
into trunk happen with rebase. It doesn't say anything about the
periodic uprev's, but it'd be very strange to merge periodically and
then rebase once at the end. So I take it to mean doing periodic
uprevs with rebase too.


On Mon, Aug 17, 2015 at 11:23 AM, Sangjin Lee sj...@apache.org wrote:

 Just to be clear, are we discussing the process of uprev'ing the feature
 development branch with the latest from the trunk from time to time, or
 making the final merge of the feature branch onto the trunk?

 On Mon, Aug 17, 2015 at 10:21 AM, Steve Loughran ste...@hortonworks.com
 wrote:

  I haven't done a bit piece of work in the ASF code repo since the
  migration to git; though I have done it in the svn era.
 
 
  Currently with private git repos
  -anyone gets SCM control of their source
  -you can commit for your own reasons (about to make a change, want a
  private jenkins run, ...) and gain from having many small checkins. More
  succinctly: if you aren't checking in your work 2+ times a day —why not?
  -rebasing a painful necessity on personal, private branches to keep the
  final patch to hadoop git a single diff
 
  With the private git process that's the defacto standard, we lose history
  anyway. I know what I've done and somewhere there's a tag in my own
 github
  repo of my work to create a JIRA. But we don't always need that entire
  history of trying to debug kerberos, typo in exception, and other
 stuff
  that accrues during the work.
 
  I think therefore that I'm in favour of big squash commits. What we could
  do is extend that with a policy of
 
 
1.  tag the final commit used to make the patch, something like
  tag_HADOOP-8192. The tag ensures that the history isn't gc'd
2.  Delete the branch (keeps the #of branches down)
3.  In the JIRA, include the name of the tag and the git commit number
  in the comments. Someone curious can rebuild that history
 
 



Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Sangjin Lee
Thanks for the clarification Andrew.

So is the proposal on the table squashing commits (on the feature branch)
when we rebase the feature branch with the latest from trunk? How would the
process work? A simple schematic example might be helpful in understanding
the proposal. If the feature branch was pushed to the remote repo, then
squashing commits (i.e. rewriting commits) could become tricky, right?
Thanks in advance.

On Mon, Aug 17, 2015 at 11:33 AM, Andrew Wang andrew.w...@cloudera.com
wrote:

 @Sangjin,

 I believe this is covered by the [VOTE] I linked to above, key excerpt
 being:

3. Force-push on feature-branches is allowed. Before pulling in a
feature, the feature-branch should be rebased on latest trunk and the
changes applied to trunk through git rebase --onto or git cherry-pick
commit-range.

 This specifies that the last uprev final integration of the branch
 into trunk happen with rebase. It doesn't say anything about the
 periodic uprev's, but it'd be very strange to merge periodically and
 then rebase once at the end. So I take it to mean doing periodic
 uprevs with rebase too.


 On Mon, Aug 17, 2015 at 11:23 AM, Sangjin Lee sj...@apache.org wrote:

  Just to be clear, are we discussing the process of uprev'ing the feature
  development branch with the latest from the trunk from time to time, or
  making the final merge of the feature branch onto the trunk?
 
  On Mon, Aug 17, 2015 at 10:21 AM, Steve Loughran ste...@hortonworks.com
 
  wrote:
 
   I haven't done a bit piece of work in the ASF code repo since the
   migration to git; though I have done it in the svn era.
  
  
   Currently with private git repos
   -anyone gets SCM control of their source
   -you can commit for your own reasons (about to make a change, want a
   private jenkins run, ...) and gain from having many small checkins.
 More
   succinctly: if you aren't checking in your work 2+ times a day —why
 not?
   -rebasing a painful necessity on personal, private branches to keep the
   final patch to hadoop git a single diff
  
   With the private git process that's the defacto standard, we lose
 history
   anyway. I know what I've done and somewhere there's a tag in my own
  github
   repo of my work to create a JIRA. But we don't always need that entire
   history of trying to debug kerberos, typo in exception, and other
  stuff
   that accrues during the work.
  
   I think therefore that I'm in favour of big squash commits. What we
 could
   do is extend that with a policy of
  
  
 1.  tag the final commit used to make the patch, something like
   tag_HADOOP-8192. The tag ensures that the history isn't gc'd
 2.  Delete the branch (keeps the #of branches down)
 3.  In the JIRA, include the name of the tag and the git commit
 number
   in the comments. Someone curious can rebuild that history
  
  
 



Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Andrew Wang
Hi all,

I've thought about this topic more over the last week, and felt I should
play devil's advocate for a merge workflow. A few comments:

   - The issue of merges polluting history is mainly an issue when using
   a github PR workflow, which results in one merge per PR. Clearly this is
   not okay, but a separate issue from feature branches. We only have a
   handful of merge commits per feature branch.
   - The issue of changes hiding in merge commits can happen when resolving
   rebase conflicts too, except it's harder to track. Right now neither go
   through code review, which is sketchy. We probably should review these too,
   and it's easier to review a single merge commit vs. an entire rebased
   branch. Merge is also a more natural way of integrating changes from trunk,
   since you just resolve all conflicts at once at the end.
   - Merge gives us a linear history on the branch but worse history on
   trunk/branch-2. Rebase has worse history on the branch but a linear history
   on trunk/branch-2. This means for quick/small feature branches that don't
   have a lot of conflicts, rebase is preferred. For large features with lots
   of conflicts, merge is preferred. This is basically what we're running into
   on HDFS-7285.
   - Rebase also comes with increased coordination costs, since public
   history is being rewritten. This is again okay for smaller efforts (where
   there are fewer contributors), but more painful with bigger ones. There
   have been a number of HDFS-7285 branches created basically as a result of
   rebase, with corresponding JIRA discussions about where to commit things.
   - The issue of a single squashed commit for the branch-2 backport is
   arguably an issue with how we structure our branches. If release branches
   forked off of trunk rather than branch-2, we wouldn't have this problem. We
   could require branch-2 integration to also happen via git merge. Or we kick
   trunk out to a feature branch based off of branch-2. Or we shrug and keep
   the status quo.

I'd definitely appreciate commentary from others who've worked on feature
branches in git, even in communities outside of Hadoop.

If there is support for allowing merge-based workflows in addition to
rebase, we'd need to kick off a [VOTE] thread since the last [VOTE] only
allows rebase.

Best,
Andrew

On Mon, Aug 17, 2015 at 11:33 AM, Andrew Wang andrew.w...@cloudera.com
wrote:

 @Sangjin,

 I believe this is covered by the [VOTE] I linked to above, key excerpt
 being:

3. Force-push on feature-branches is allowed. Before pulling in a
feature, the feature-branch should be rebased on latest trunk and the
changes applied to trunk through git rebase --onto or git cherry-pick
commit-range.

 This specifies that the last uprev final integration of the branch into trunk 
 happen with rebase. It doesn't say anything about the periodic uprev's, but 
 it'd be very strange to merge periodically and then rebase once at the end. 
 So I take it to mean doing periodic uprevs with rebase too.


 On Mon, Aug 17, 2015 at 11:23 AM, Sangjin Lee sj...@apache.org wrote:

 Just to be clear, are we discussing the process of uprev'ing the feature
 development branch with the latest from the trunk from time to time, or
 making the final merge of the feature branch onto the trunk?

 On Mon, Aug 17, 2015 at 10:21 AM, Steve Loughran ste...@hortonworks.com
 wrote:

  I haven't done a bit piece of work in the ASF code repo since the
  migration to git; though I have done it in the svn era.
 
 
  Currently with private git repos
  -anyone gets SCM control of their source
  -you can commit for your own reasons (about to make a change, want a
  private jenkins run, ...) and gain from having many small checkins. More
  succinctly: if you aren't checking in your work 2+ times a day —why not?
  -rebasing a painful necessity on personal, private branches to keep the
  final patch to hadoop git a single diff
 
  With the private git process that's the defacto standard, we lose
 history
  anyway. I know what I've done and somewhere there's a tag in my own
 github
  repo of my work to create a JIRA. But we don't always need that entire
  history of trying to debug kerberos, typo in exception, and other
 stuff
  that accrues during the work.
 
  I think therefore that I'm in favour of big squash commits. What we
 could
  do is extend that with a policy of
 
 
1.  tag the final commit used to make the patch, something like
  tag_HADOOP-8192. The tag ensures that the history isn't gc'd
2.  Delete the branch (keeps the #of branches down)
3.  In the JIRA, include the name of the tag and the git commit number
  in the comments. Someone curious can rebuild that history
 
 





Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-17 Thread Jing Zhao
I think we should allow merge-based workflows. I worked and am working in
several big feature branches, including HDFS-2802 (100 subtasks) and
HDFS-7285 (currently already  200 subtasks), and tried both the
merge-based and rebase-based workflows. When the feature change becomes
big, the rebase will become a big pain, considering a small change in trunk
can cause conflicts for rebasing large number of commits in the feature
branch. Using git merge to merge trunk changes into the feature branch is
much easier in this case.

Thanks,
-Jing

On Mon, Aug 17, 2015 at 12:17 PM, Andrew Wang andrew.w...@cloudera.com
wrote:

 Hi all,

 I've thought about this topic more over the last week, and felt I should
 play devil's advocate for a merge workflow. A few comments:

- The issue of merges polluting history is mainly an issue when using
a github PR workflow, which results in one merge per PR. Clearly this is
not okay, but a separate issue from feature branches. We only have a
handful of merge commits per feature branch.
- The issue of changes hiding in merge commits can happen when resolving
rebase conflicts too, except it's harder to track. Right now neither go
through code review, which is sketchy. We probably should review these
 too,
and it's easier to review a single merge commit vs. an entire rebased
branch. Merge is also a more natural way of integrating changes from
 trunk,
since you just resolve all conflicts at once at the end.
- Merge gives us a linear history on the branch but worse history on
trunk/branch-2. Rebase has worse history on the branch but a linear
 history
on trunk/branch-2. This means for quick/small feature branches that
 don't
have a lot of conflicts, rebase is preferred. For large features with
 lots
of conflicts, merge is preferred. This is basically what we're running
 into
on HDFS-7285.
- Rebase also comes with increased coordination costs, since public
history is being rewritten. This is again okay for smaller efforts
 (where
there are fewer contributors), but more painful with bigger ones. There
have been a number of HDFS-7285 branches created basically as a result
 of
rebase, with corresponding JIRA discussions about where to commit
 things.
- The issue of a single squashed commit for the branch-2 backport is
arguably an issue with how we structure our branches. If release
 branches
forked off of trunk rather than branch-2, we wouldn't have this
 problem. We
could require branch-2 integration to also happen via git merge. Or we
 kick
trunk out to a feature branch based off of branch-2. Or we shrug and
 keep
the status quo.

 I'd definitely appreciate commentary from others who've worked on feature
 branches in git, even in communities outside of Hadoop.

 If there is support for allowing merge-based workflows in addition to
 rebase, we'd need to kick off a [VOTE] thread since the last [VOTE] only
 allows rebase.

 Best,
 Andrew

 On Mon, Aug 17, 2015 at 11:33 AM, Andrew Wang andrew.w...@cloudera.com
 wrote:

  @Sangjin,
 
  I believe this is covered by the [VOTE] I linked to above, key excerpt
  being:
 
 3. Force-push on feature-branches is allowed. Before pulling in a
 feature, the feature-branch should be rebased on latest trunk and the
 changes applied to trunk through git rebase --onto or git
 cherry-pick
 commit-range.
 
  This specifies that the last uprev final integration of the branch into
 trunk happen with rebase. It doesn't say anything about the periodic
 uprev's, but it'd be very strange to merge periodically and then rebase
 once at the end. So I take it to mean doing periodic uprevs with rebase too.
 
 
  On Mon, Aug 17, 2015 at 11:23 AM, Sangjin Lee sj...@apache.org wrote:
 
  Just to be clear, are we discussing the process of uprev'ing the feature
  development branch with the latest from the trunk from time to time, or
  making the final merge of the feature branch onto the trunk?
 
  On Mon, Aug 17, 2015 at 10:21 AM, Steve Loughran 
 ste...@hortonworks.com
  wrote:
 
   I haven't done a bit piece of work in the ASF code repo since the
   migration to git; though I have done it in the svn era.
  
  
   Currently with private git repos
   -anyone gets SCM control of their source
   -you can commit for your own reasons (about to make a change, want a
   private jenkins run, ...) and gain from having many small checkins.
 More
   succinctly: if you aren't checking in your work 2+ times a day —why
 not?
   -rebasing a painful necessity on personal, private branches to keep
 the
   final patch to hadoop git a single diff
  
   With the private git process that's the defacto standard, we lose
  history
   anyway. I know what I've done and somewhere there's a tag in my own
  github
   repo of my work to create a JIRA. But we don't always need that entire
   history of trying to debug kerberos, typo in exception, and other
  stuff
   that accrues 

Re: [DISCUSS] git rebase vs. git merge for branch development

2015-08-15 Thread Karthik Kambatla
I prefer Proposal #1 as well. Squashing some of the commits seems a major
improvement over our previous model of a single commit for the entire
branch.

On Tue, Aug 11, 2015 at 2:19 PM, Andrew Wang andrew.w...@cloudera.com
wrote:

 Hi all,

 We are currently working on a pretty substantial new feature in a branch
 over at HDFS-7285. As the # of commits has grown, running `git rebase` and
 fixing conflicts in the 180+ commits has become untenable. As you may
 recall, we voted to use a rebase workflow when we did the switch from SVN
 to git a year ago [1].

 I'm aware of two proposals right now:

 

 Proposal 1: Squash some of the commits to make rebase easier.

 Often times, intermediate commits are made to code that get changed again
 later, and thus don't end up in HEAD. Fixing conflicts in these
 intermediate commits is a waste of time, especially with 180 commits. I run
 into this issue even with my local feature branches, and thus squash.

 The downside is that squashing loses some of the development history, since
 now multiple JIRAs are combined into a single commit. There are some ways
 to mitigate this: the old branch with the full history can be left in
 place, and the squashed commits can reference the JIRAs that have been
 squashed together.

 

 Proposal 2: Allow merge-based workflows too.

 This is what we were doing in the SVN days. Periodically merge trunk to the
 branch, resulting in merge commits to resolve conflicts. When the branch is
 ready, merge it back to trunk.

 I read through the discussion thread [2] where we decided to go with
 rebase, The concerns were that merge commits pollute history, which was an
 issue for HBase and I believe Spark. Merge commits are not associated with
 a single JIRA or commit, and fixes are sometimes hidden in merge commits.
 This makes backports harder.

 Merge-based workflows also squash the history when backporting to a branch.
 In the SVN merge-based days, backporting to branch-2 was typically done as
 a single squashed commit. With a rebase workflow, it's possible to rebase
 the branch against branch-2 and get the same history as trunk.

 

 My mild preference is for Proposal #1 since it results in a clean linear
 history in both trunk and branch-2, but it has to be understood that
 squashing is sometimes a required part of a rebase workflow. If the core
 issue with squashing is maintaining development history, I think it's
 satisfied by keeping old branches around and referencing the squashed
 JIRAs.

 Welcome other thoughts here too.

 Best,
 Andrew

 [1]:

 http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201408.mbox/%3CCALwhT94Y64M9keY25Ry_QOLUSZQT29tJQ95twsoa8xXrcNTxpQ%40mail.gmail.com%3E

 [2]:

 http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201408.mbox/%3CCALwhT97bM36X6-3%3DcCUwaAKxZ80jfZwuf53BTR7TbWwV5e%2BXkA%40mail.gmail.com%3E



[DISCUSS] git rebase vs. git merge for branch development

2015-08-11 Thread Andrew Wang
Hi all,

We are currently working on a pretty substantial new feature in a branch
over at HDFS-7285. As the # of commits has grown, running `git rebase` and
fixing conflicts in the 180+ commits has become untenable. As you may
recall, we voted to use a rebase workflow when we did the switch from SVN
to git a year ago [1].

I'm aware of two proposals right now:



Proposal 1: Squash some of the commits to make rebase easier.

Often times, intermediate commits are made to code that get changed again
later, and thus don't end up in HEAD. Fixing conflicts in these
intermediate commits is a waste of time, especially with 180 commits. I run
into this issue even with my local feature branches, and thus squash.

The downside is that squashing loses some of the development history, since
now multiple JIRAs are combined into a single commit. There are some ways
to mitigate this: the old branch with the full history can be left in
place, and the squashed commits can reference the JIRAs that have been
squashed together.



Proposal 2: Allow merge-based workflows too.

This is what we were doing in the SVN days. Periodically merge trunk to the
branch, resulting in merge commits to resolve conflicts. When the branch is
ready, merge it back to trunk.

I read through the discussion thread [2] where we decided to go with
rebase, The concerns were that merge commits pollute history, which was an
issue for HBase and I believe Spark. Merge commits are not associated with
a single JIRA or commit, and fixes are sometimes hidden in merge commits.
This makes backports harder.

Merge-based workflows also squash the history when backporting to a branch.
In the SVN merge-based days, backporting to branch-2 was typically done as
a single squashed commit. With a rebase workflow, it's possible to rebase
the branch against branch-2 and get the same history as trunk.



My mild preference is for Proposal #1 since it results in a clean linear
history in both trunk and branch-2, but it has to be understood that
squashing is sometimes a required part of a rebase workflow. If the core
issue with squashing is maintaining development history, I think it's
satisfied by keeping old branches around and referencing the squashed JIRAs.

Welcome other thoughts here too.

Best,
Andrew

[1]:
http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201408.mbox/%3CCALwhT94Y64M9keY25Ry_QOLUSZQT29tJQ95twsoa8xXrcNTxpQ%40mail.gmail.com%3E

[2]:
http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201408.mbox/%3CCALwhT97bM36X6-3%3DcCUwaAKxZ80jfZwuf53BTR7TbWwV5e%2BXkA%40mail.gmail.com%3E