Alan,

I agree with all your statements, with the exception of one.

"Second, the way Apache works is that contributors scratch the itch that =
bothers them. So to argue "We shouldn't do X because we never finished =
Y" or "We shouldn't do X because we're doing Y" (where X and Y are =
independent) is not valid in Apache projects.

I disagree, look at this:

https://issues.apache.org/jira/browse/HIVE-3585

A contribution was immediately met with a -1.

I personally have had issues closed as "WONT FIX", "LATER" across a variety
of apache projects because said committers decided the feature was out of
scope, or whatever.

Arguing that if one contributer wants to "scratch an itch" we should allow
it in the project is not practical. Because we have to be able to maintain
hive after the "itch scratcher" finds a new itch, and moves on. Hive is not
project hosting for "every cool idea".

This was why I mentioned things like "windows support", I do not think
there was ever a point where the committers/PMC agreed that "windows
support" was something we all wanted to work towards. I can not pin down
how the initiative started and why. Now whoever started that ball rolling
has moved on. I do not own a windows computer, we have no apache
infrastructure to test hive on windows. Jira issues stay open, those of us
in it for the long haul and up holding the ball, and supporting things we
never explicitly wanted.

As this relates to Tez, tez is in the incubator. Hive is release quality
software. I am not convinced Tez is the direction we should go in. I am
scared of it going the path of "windows support" or "oracle support",
because someone "scratching an itch" and we (the committers) do not have
enough information, about the changes involved, the timeline, what types of
use cases will benefit from this feature.

Tez refactoring are getting filed as 'MAJOR' 'BUGS' and getting committed
to trunk, when they are 'IMPROVEMENTS' that are 'LOW' priority. I do not
understand why there is such a priority to merge code into trunk, when we
can all see this branch is going to be opened for a long time and be rather
involved. Even then I would not mind if it was not largely unfair to
everyone else that now needs to rebase.








On Tue, Jul 16, 2013 at 2:24 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Ed,
>
> I'm not sure I understand your argument, so I'm going to try to restate
> it.  Please tell me if I understand it correctly.
>
> I think you're saying we should not embark on big projects in Hive because:
> 1) There were big projects in the past that were abandoned or are not
> currently making progress (such as Oracle integration, Hive StorageHandler)
> 2) There are other big projects going on (ORC, Vectorization)
> 3) There are lots of out standing patches that need to be dealt with.
>
> I would respond with two points to this.
>
> First, I agree that the large out standing patch count is very bad.  It
> keeps people from getting involved in Hive.  It deprives Hive of fixes and
> improvements it would otherwise have.  Several of the committers are
> working to address this by checking in peoples' patches, but they are
> unable to keep up.  The best solution is to encourage other committers to
> check in patches as well and to find willing and able contributors and
> mentor them to committership as quickly as possible.
>
> Second, the way Apache works is that contributors scratch the itch that
> bothers them. So to argue "We shouldn't do X because we never finished Y"
> or "We shouldn't do X because we're doing Y" (where X and Y are
> independent) is not valid in Apache projects.  It's fine to argue that Tez
> hasn't been adequately explained (I think you hinted at this in previous
> emails) and ask for clarifications on what it is and what the planned
> changes are.  If after a full explanation you think it's a bad idea it's
> fine to argue Tez is the wrong direction for Hive and try to convince the
> rest of the community.  But assuming the community accepts that Tez is a
> reasonable direction and there are volunteers who want to do the work, then
> you can't argue they should work on something else instead.
>
> Alan.
>
> On Jul 15, 2013, at 6:51 PM, Edward Capriolo wrote:
>
> >>> The Hive bylaws,
> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
> votes are needed for what.  I don't see anything there about
> > needing 3 +1s for a branch.  Branching >>would seem to fall under code
> > change, which requires one vote and a minimum length of 1 day.
> >
> > You could argue that all you need is one +1 to create a branch, but this
> is
> > more then a branch. If you are talking about something that is:
> > 1) going to cause major re-factoring of critical pieces of hive like
> > ExecDriver and MapRedTask
> > 2) going to be very disruptive to the efforts of other committers
> > 3) something that may be a major architectural change
> >
> > Getting the project on board with the idea is a good idea.
> >
> > Now I want to point something out. Here are some recent initiatives in
> hive:
> >
> > 1) At one point there was a big initiative to "support oracle" after the
> > initial work, there are patches in Jira no one seems to care about oracle
> > support.
> > 2) Another such decisions was this "support windows" one, there are
> > probably 4 windows patches waiting reviews.
> > 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23
> > support prospective is, but every couple weeks we get another jira about
> > something not working/testing on one of those versions, seems like
> several
> > builds are broken.
> > 4) Hive-storage handler, after the initial implementation no one cares to
> > review any other storage handler implementation, 3 patches there or more,
> > could not even find anyone willing to review the cassandra storage
> handler
> > I spent months on.
> > 5) OCR, Vectorization
> > 6) Windowing: committed, numerous check-style violations.
> >
> > We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers. We
> > are spread very thin, and embarking on another side project not involved
> > with core hive seems like the wrong direction at the moment.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
> >
> >>
> >> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
> >>
> >>> I have started to see several re factoring patches around tez.
> >>> https://issues.apache.org/jira/browse/HIVE-4843
> >>>
> >>> This is the only mention on the hive list I can find with tez:
> >>> "Makes sense. I will create the branch soon.
> >>>
> >>> Thanks,
> >>> Ashutosh
> >>>
> >>>
> >>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
> >>> ghagleit...@hortonworks.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I am starting to work on integrating Tez into Hive (see HIVE-4660,
> >> design
> >>>> doc has already been uploaded - any feedback will be much
> appreciated).
> >>>> This will be a fair amount of work that will take time to
> >> stabilize/test.
> >>>> I'd like to propose creating a branch in order to be able to do this
> >>>> incrementally and collaboratively. In order to progress rapidly with
> >> this,
> >>>> I would also like to go "commit-then-review".
> >>>>
> >>>> Thanks,
> >>>> Gunther.
> >>>> "
> >>>
> >>> These refactor-ings are largely destructive to a number of bugs and
> >>> language improvements in hive.The language improvements and bug fixes
> >> that
> >>> have been sitting in Jira for quite some time now marked
> patch-available
> >>> and are waiting for review.
> >>>
> >>> There are a few things I want to point out:
> >>> 1) Normally we create design docs in out wiki (which it is not)
> >>> 2) Normally when the change is significantly complex we get multiple
> >>> committers to comment on it (which we did not)
> >>> On point 2 no one -1  the branch, but this is really something that
> >> should
> >>> have required a +1 from 3 committers.
> >>
> >> The Hive bylaws,
> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
> votes are needed for what.  I don't see anything there about
> >> needing 3 +1s for a branch.  Branching would seem to fall under code
> >> change, which requires one vote and a minimum length of 1 day.
> >>
> >>>
> >>> I for one am not completely sold on Tez.
> >>> http://incubator.apache.org/projects/tez.html.
> >>> "directed-acyclic-graph of tasks for processing data" this description
> >>> sounds like many things which have never become popular. One to think
> of
> >> is
> >>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
> >>> actions.". I am sure I can find a number of libraries/frameworks that
> >> make
> >>> this same claim. In general I do not feel like we have done our
> homework
> >>> and pre-requisites to justify all this work. If we have done the
> >> homework,
> >>> I am sure that it has not been communicated and accepted by hive
> >> developers
> >>> at large.
> >>
> >> A request for better documentation on Tez and a project road map seems
> >> totally reasonable.
> >>
> >>>
> >>> If we have a branch, why are we also committing on trunk? Scanning
> >> through
> >>> the tez doc the only language I keep finding language like "minimal
> >> changes
> >>> to the planner" yet, there is ALREADY lots of large changes going on!
> >>>
> >>> Really none of the above would bother me accept for the fact that these
> >>> "minimal changes" are causing many "patch available" ready-for-review
> >> bugs
> >>> and core hive features to need to be re based.
> >>>
> >>> I am sure I have mentioned this before, but I have to spend 12+ hours
> to
> >>> test a single patch on my laptop. A few days ago I was testing a new
> core
> >>> hive feature. After all the tests passed and before I was able to
> commit,
> >>> someone unleashed a tez patch on trunk which caused the thing I was
> >> testing
> >>> for 12 hours to need to be rebased.
> >>>
> >>>
> >>> I'm not cool with this.Next time that happens to me I will seriously
> >>> consider reverting the patch. Bug fixes and new hive features are more
> >>> important to me then integrating with incubator projects.
> >>
> >> (With my Apache member hat on)  Reverting patches that aren't breaking
> the
> >> build is considered very bad form in Apache.  It does make sense to
> request
> >> that when people are going to commit a patch that will break many other
> >> patches they first give a few hours of notice so people can say
> something
> >> if they're about to commit another patch and avoid your fate of needing
> to
> >> rerun the tests.  The other thing is we need to get get the automated
> build
> >> of patches working on Hive so committers are forced to run all of the
> tests
> >> themselves.  We are working on it, but we're not there yet.
> >>
> >> Alan.
> >>
> >>
>
>

Reply via email to