Re: [DISCUSS] Separating out the metastore as its own TLP

Alan Gates Mon, 10 Jul 2017 08:43:25 -0700

+1 to having an always releasable head of master.

+1 to having test verified API compliance.  I was thinking that the project
should set up verification tests where it runs against supported versions
of Hive, Imapala, Spark, … (and obviously open to others to add their tests
as well) on a nightly basis so that we guarantee API stability.


Alan.

On Thu, Jul 6, 2017 at 2:10 AM, Peter Vary <pv...@cloudera.com> wrote:

> Hi folks,
>
> I agree with most of the things Edward said. I have faced similar issues
> in smaller scale when integrated Hive with Yetus. We are forced to keep
> patched Yetus files in Hive repo until they push their next release. Also
> followed one more serious problem when a patch was committed to Hive,
> Impala and Spark, and just few days before the release all of them was
> reverted from the projects due to concerns raised by the Spark committee
> (after the changes was already committed to Spark as well)
>
> Having said all of these, I still think that separating the HMS to a new
> top level project could be a step to the right direction with the following
> constraints. The new project should have:
> - Strict, stability oriented branching strategy following Edward's
> suggestions, so if a downstream project - for example Hive - needs some fix
> or easy change that could be incorporated, and released almost immediately.
> So we have to have these:
>         - Always releasable head
>         - Every multi commit feature should be added as a feature branch
> - Strict, enforced, stability oriented API strategy. So we will not be
> surprised by features added by other projects and break Hive compatibility.
> To avoid this situation we need to design for it, have pre-commit tests in
> place for catch the in-adverted changes, and most importantly have a clear
> commitment for it.
>
> I think, since the current HMS is already used by numerous other projects,
> we already should have these in mind when modifying anything in HMS related
> code. This is not the main focus of Hive, so we do not concentrate on this
> and there are often interoperability issues, problems. We can do this
> inside Hive as well, but the current approach followed by Hive, and the one
> required by the HMS are requiring a different mindset. We need a clear,
> well defined boundary and separating the 2 projects could help in this. We
> can focus on the different needs and goal and eventually we might have
> different culture as well which suits the specific needs of the specific
> part of the code.
>
> I think keeping these rules in the new to level HMS we can mitigate most
> of the issues mentioned below, and we will be better of overall.
> What do you think Edward?
>
> Thanks,
> Peter
>
>
> > On Jul 5, 2017, at 10:16 PM, Xuefu Zhang <xu...@apache.org> wrote:
> >
> > I think Edward's concern is valid. While I voiced my support for this
> > proposal, which was more from the benefits of the whole Hadoop
> ecosystem, I
> > don't see the equal benefits for Hive. Instead, it may even create more
> > overhead for Hive. I'd really like to take time to see what are the road
> > blocks for other projects to use HMS as it is. The issue of Spark
> including
> > a Hive fork, which was brought up some time back, is certainly not one of
> > them.
> >
> > Thanks,
> > Xuefu
> >
> > On Wed, Jul 5, 2017 at 12:33 PM, Edward Capriolo <edlinuxg...@gmail.com>
> > wrote:
> >
> >> On Wed, Jul 5, 2017 at 1:51 PM, Alan Gates <alanfga...@gmail.com>
> wrote:
> >>
> >>> On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo <edlinuxg...@gmail.com
> >
> >>> wrote:
> >>>
> >>>>
> >>>> We already have things in the meta-store not directly tied to language
> >>>> features. For example hive metastore has a "retention" property which
> >> is
> >>>> not actively in use by anything. In reality, we rarely say 'no' or -1
> >> to
> >>>> much. Which in part is why I believe our release process is grinding
> >>>> slower: we have so many things in flight I do not feel that any one
> >>> person
> >>>> can keep track. You are working on porting the metastore to hbase.
> >>>> https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or
> >> 'No'
> >>>> along the way? When I first noticed this I pointed out that someone
> has
> >>>> already ported the metastore to Cassandra
> >>>> https://github.com/riptano/brisk/blob/master/src/java/
> >>>> src/org/apache/cassandra/hadoop/hive/metastore/SchemaManager
> >>> Service.java,
> >>>> but I was more exciting/rational for this multi-year approach using
> >> hbase
> >>>> so I let everyone 'have at it'.
> >>>>
> >>> Your example and mine are not equivalent.  The HBase metastore is
> still a
> >>> Hive feature, even if some thought it not worth while.  That is
> different
> >>> than people bringing features that will never interest Hive or that
> Hive
> >>> could never use (e.g. Dain’s desire for the metastore to support Presto
> >>> style views).
> >>>
> >>> I forgot to mention the issue these would be non-Hive contributors have
> >>> with releases if they contribute their features to the metastore while
> >> it’s
> >>> inside Hive.  Is Hive going to do a release just to push out features
> in
> >>> the metastore that it doesn’t care about?
> >>>
> >>> You seem to be asserting that doing this doesn’t really help non-Hive
> >> based
> >>> systems that are using or would like to use the metastore.  But it is
> >>> interesting that people from three of those systems have commented in
> the
> >>> thread so far, and all are positive (Dmitrias from Impala, Dain from
> >>> Presto, and Sriharsha from the schema registry project).
> >>>
> >>>
> >>>> I am going to give a hypothetical but real world situation. Suppose I
> >>> want
> >>>> to add the statement "CREATE permanent macro xyz", this feature I
> >> believe
> >>>> would cross cut calcite, hive, and hive metastore. To build this
> >> feature
> >>> I
> >>>> would need to orchestrate the change across 3 separate groups of hive
> >>>> 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3
> >>>> releases. That is not counting if we run into some bug or misfeature
> >>> (maybe
> >>>> with Tez or something else) so that brings in 4-5 releases of upstream
> >> to
> >>>> add a feature to hive. This does not take into account normal
> processes
> >>>> mess ups. For example say you get the metastore done, but now the
> >> people
> >>>> doing the calcite/antlr suggest the feature have different syntax
> >> because
> >>>> they did not read the 3-4 linked tickets when the process started?
> Now,
> >>> you
> >>>> have to loop back around the process. Finding 1 person in 1 project to
> >>>> usher along the feature you want is difficult, having to find and
> clear
> >>>> time with 3 people across three projects is going to be a difficult
> >> along
> >>>> with then 'pushing' them all to kick out a release so you can finally
> >> use
> >>>> said feature.
> >>>>
> >>>
> >>> I partially agree with you.  On the reviews, JIRAs, etc. I don’t think
> it
> >>> adds much, if any, overhead.  Hive is a big project and no one person
> >> knows
> >>> all the code anymore.  If you wanted to add a permanent macros feature
> >> you
> >>> would need reviews from someone who knows the parser (probably
> >> Pengcheng),
> >>> people who know the optimizer (Jesus, Ashutosh, …), and someone who
> knows
> >>> the metastore (me, Thejas, …).  And any large feature is going to be
> >>> implemented over multiple JIRAs, all of which are linkable regardless
> of
> >>> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think
> it
> >>> makes the feature disagreement any worse.  If the optimizer team
> >> absolutely
> >>> insists it has to have some feature and the metastore team insists that
> >> it
> >>> can’t have that feature you’re going to have to work through the issue
> >>> whether they all are in Hive or in two separate projects.
> >>>
> >>> Where I agree the split adds cost is releases.  Before your macro
> feature
> >>> could go live you need releases from each of the components.  And while
> >> in
> >>> development the components need to use snapshot versions of the other
> >>> components.  My assertion is that the benefits out weigh this cost.
> >>>
> >>> Alan.
> >>>
> >>
> >>
> >> "You seem to be asserting that doing this doesn’t really help non-Hive
> >> based
> >> systems that are using or would like to use the metastore.  But it is
> >> interesting that people from three of those systems have commented in
> the
> >> thread so far, and all are positive (Dmitrias from Impala, Dain from
> >> Presto, and Sriharsha from the schema registry project)."
> >>
> >> I notice that impala has a syntax for caching.
> >>
> >> https://www.cloudera.com/documentation/enterprise/5-8-x/topi
> >> cs/impala_perf_hdfs_caching.html
> >>
> >> Notice how the cache syntax did not way into Hive? It would make sense
> if
> >> this feature trickled it's way into hive and use HDFS caching for
> example.
> >> I have heard many people claim that using hive metastore is such a
> because
> >> it is packaged weird (like with ORC), but again besides
> claim/complaining
> >> no one has stepped up to deal with that.
> >>
> >> What I would suggest is going forward for maybe a trial period of 6
> months,
> >> labeling JIRA tickets with a tag that would be
> >> "SeeThisProvesWeNeedATLPMetastore". Because right now I do not enough
> >> active use cases of people giving anything back to justify hurting our
> >> workflow so much.
> >>
> >>
> >> "I partially agree with you.  On the reviews, JIRAs, etc. I don’t think
> it
> >> adds much, if any, overhead.  Hive is a big project and no one person
> knows
> >> all the code anymore.  If you wanted to add a permanent macros feature
> you
> >> would need reviews from someone who knows the parser (probably
> Pengcheng),
> >> people who know the optimizer (Jesus, Ashutosh, …), and someone who
> knows
> >> the metastore (me, Thejas, …).  And any large feature is going to be
> >> implemented over multiple JIRAs, all of which are linkable regardless of
> >> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think
> it
> >> makes the feature disagreement any worse.  If the optimizer team
> absolutely
> >> insists it has to have some feature and the metastore team insists that
> it
> >> can’t have that feature you’re going to have to work through the issue
> >> whether they all are in Hive or in two separate projects"
> >>
> >> Macro was done in 1 patch and reviewed by 2 people. With 2-3 follow on
> >> bugs.
> >>
> >> https://issues.apache.org/jira/browse/HIVE-2655
> >>
> >> I think your perception is different then mine because of
> circumstances. I
> >> have waited weeks/months for reviews/merges (in Hive and other apache
> >> projects) from mundane udfs to cassandra-storage-handlers. You obviously
> >> work in a large company and you can more easily align objectives, go to
> the
> >> water cooler and say "hey bob you know it would be cool if you can
> release
> >> x so I can do y". When you are not in that situation its like, "hey
> mailing
> >> list, my patch was done for three months now and like I have had to
> rebase
> >> it three times and like I notice like other stuff is getting committed."
> >>
> >> If you look at it tactically, "create permanent macro xzy". I go over to
> >> calcite and suggest some changes there, if this concept is not "game
> >> changer" it is probably going to sit unreviewed. If it is "game changer"
> >> exciting that is 72 hours for release voting. Next go to hive-metastore
> >> repeat the process, but remember now I have to "wow" the metastore
> people
> >> with the "game changer" and if that crew is super focused on something
> >> about kafka well now Hive features are second fiddle. Now lets say a
> hive
> >> release is coming up, and I really want my feature in it.
> >> hive-metastore-tlp might currently have a broken trunk because mongo
> wants
> >> to add spaceships to wombats feature has a bug and frankly that should
> not
> >> effect us.
> >>
> >> I hate to draw in something else but I feel it is related:
> >>
> >> 8 December 2016 : release 2.1.1 available
> >> 07 April 2017 : release 1.2.2 available
> >> hive-dev [DISCUSS] Supporting Hadoop-1 and experimental features
> >> hive-dev Re: release chaos?
> >>
> >> I have been vocal about not liking certain branching strategies and
> >> proposals that take us away from releasable trunk. We have steadily
> headed
> >> in a direction where we are pulling things out of hive, and we are not
> able
> >> to turn out releases. We even had a thread "release chaos" talking about
> >> our 5 active branches (with friends I say "jumped the shark"). Pulling
> out
> >> the metastore is only going to make this worse. I do not even see the
> model
> >> as successful. You may say it is great that calcite lets people share
> our
> >> sql dialect or the ORC TLP has 5 committers, but if Hive can not get a
> >> release out the door I do not see us optimizing for the proper thing.
> >>
>
>

Re: [DISCUSS] Separating out the metastore as its own TLP

Reply via email to