Re: [DISCUSS] Separating out the metastore as its own TLP

Peter Vary Thu, 06 Jul 2017 02:11:34 -0700

Hi folks,

I agree with most of the things Edward said. I have faced similar issues in 
smaller scale when integrated Hive with Yetus. We are forced to keep patched 
Yetus files in Hive repo until they push their next release. Also followed one 
more serious problem when a patch was committed to Hive, Impala and Spark, and 
just few days before the release all of them was reverted from the projects due 
to concerns raised by the Spark committee (after the changes was already 
committed to Spark as well)


Having said all of these, I still think that separating the HMS to a new top 
level project could be a step to the right direction with the following 
constraints. The new project should have:
- Strict, stability oriented branching strategy following Edward's suggestions, 
so if a downstream project - for example Hive - needs some fix or easy change 
that could be incorporated, and released almost immediately. So we have to have 
these:
        - Always releasable head
        - Every multi commit feature should be added as a feature branch
- Strict, enforced, stability oriented API strategy. So we will not be 
surprised by features added by other projects and break Hive compatibility. To 
avoid this situation we need to design for it, have pre-commit tests in place 
for catch the in-adverted changes, and most importantly have a clear commitment 
for it.

I think, since the current HMS is already used by numerous other projects, we 
already should have these in mind when modifying anything in HMS related code. 
This is not the main focus of Hive, so we do not concentrate on this and there 
are often interoperability issues, problems. We can do this inside Hive as 
well, but the current approach followed by Hive, and the one required by the 
HMS are requiring a different mindset. We need a clear, well defined boundary 
and separating the 2 projects could help in this. We can focus on the different 
needs and goal and eventually we might have different culture as well which 
suits the specific needs of the specific part of the code.

I think keeping these rules in the new to level HMS we can mitigate most of the 
issues mentioned below, and we will be better of overall.
What do you think Edward?

Thanks,
Peter

  
> On Jul 5, 2017, at 10:16 PM, Xuefu Zhang <[email protected]> wrote:
> 
> I think Edward's concern is valid. While I voiced my support for this
> proposal, which was more from the benefits of the whole Hadoop ecosystem, I
> don't see the equal benefits for Hive. Instead, it may even create more
> overhead for Hive. I'd really like to take time to see what are the road
> blocks for other projects to use HMS as it is. The issue of Spark including
> a Hive fork, which was brought up some time back, is certainly not one of
> them.
> 
> Thanks,
> Xuefu
> 
> On Wed, Jul 5, 2017 at 12:33 PM, Edward Capriolo <[email protected]>
> wrote:
> 
>> On Wed, Jul 5, 2017 at 1:51 PM, Alan Gates <[email protected]> wrote:
>> 
>>> On Mon, Jul 3, 2017 at 6:20 AM, Edward Capriolo <[email protected]>
>>> wrote:
>>> 
>>>> 
>>>> We already have things in the meta-store not directly tied to language
>>>> features. For example hive metastore has a "retention" property which
>> is
>>>> not actively in use by anything. In reality, we rarely say 'no' or -1
>> to
>>>> much. Which in part is why I believe our release process is grinding
>>>> slower: we have so many things in flight I do not feel that any one
>>> person
>>>> can keep track. You are working on porting the metastore to hbase.
>>>> https://issues.apache.org/jira/browse/HIVE-9452 did you get a -1 or
>> 'No'
>>>> along the way? When I first noticed this I pointed out that someone has
>>>> already ported the metastore to Cassandra
>>>> https://github.com/riptano/brisk/blob/master/src/java/
>>>> src/org/apache/cassandra/hadoop/hive/metastore/SchemaManager
>>> Service.java,
>>>> but I was more exciting/rational for this multi-year approach using
>> hbase
>>>> so I let everyone 'have at it'.
>>>> 
>>> Your example and mine are not equivalent.  The HBase metastore is still a
>>> Hive feature, even if some thought it not worth while.  That is different
>>> than people bringing features that will never interest Hive or that Hive
>>> could never use (e.g. Dain’s desire for the metastore to support Presto
>>> style views).
>>> 
>>> I forgot to mention the issue these would be non-Hive contributors have
>>> with releases if they contribute their features to the metastore while
>> it’s
>>> inside Hive.  Is Hive going to do a release just to push out features in
>>> the metastore that it doesn’t care about?
>>> 
>>> You seem to be asserting that doing this doesn’t really help non-Hive
>> based
>>> systems that are using or would like to use the metastore.  But it is
>>> interesting that people from three of those systems have commented in the
>>> thread so far, and all are positive (Dmitrias from Impala, Dain from
>>> Presto, and Sriharsha from the schema registry project).
>>> 
>>> 
>>>> I am going to give a hypothetical but real world situation. Suppose I
>>> want
>>>> to add the statement "CREATE permanent macro xyz", this feature I
>> believe
>>>> would cross cut calcite, hive, and hive metastore. To build this
>> feature
>>> I
>>>> would need to orchestrate the change across 3 separate groups of hive
>>>> 'subcommittees' for lack of a better word. 3 git repos, 3 Jira's 3
>>>> releases. That is not counting if we run into some bug or misfeature
>>> (maybe
>>>> with Tez or something else) so that brings in 4-5 releases of upstream
>> to
>>>> add a feature to hive. This does not take into account normal processes
>>>> mess ups. For example say you get the metastore done, but now the
>> people
>>>> doing the calcite/antlr suggest the feature have different syntax
>> because
>>>> they did not read the 3-4 linked tickets when the process started? Now,
>>> you
>>>> have to loop back around the process. Finding 1 person in 1 project to
>>>> usher along the feature you want is difficult, having to find and clear
>>>> time with 3 people across three projects is going to be a difficult
>> along
>>>> with then 'pushing' them all to kick out a release so you can finally
>> use
>>>> said feature.
>>>> 
>>> 
>>> I partially agree with you.  On the reviews, JIRAs, etc. I don’t think it
>>> adds much, if any, overhead.  Hive is a big project and no one person
>> knows
>>> all the code anymore.  If you wanted to add a permanent macros feature
>> you
>>> would need reviews from someone who knows the parser (probably
>> Pengcheng),
>>> people who know the optimizer (Jesus, Ashutosh, …), and someone who knows
>>> the metastore (me, Thejas, …).  And any large feature is going to be
>>> implemented over multiple JIRAs, all of which are linkable regardless of
>>> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think it
>>> makes the feature disagreement any worse.  If the optimizer team
>> absolutely
>>> insists it has to have some feature and the metastore team insists that
>> it
>>> can’t have that feature you’re going to have to work through the issue
>>> whether they all are in Hive or in two separate projects.
>>> 
>>> Where I agree the split adds cost is releases.  Before your macro feature
>>> could go live you need releases from each of the components.  And while
>> in
>>> development the components need to use snapshot versions of the other
>>> components.  My assertion is that the benefits out weigh this cost.
>>> 
>>> Alan.
>>> 
>> 
>> 
>> "You seem to be asserting that doing this doesn’t really help non-Hive
>> based
>> systems that are using or would like to use the metastore.  But it is
>> interesting that people from three of those systems have commented in the
>> thread so far, and all are positive (Dmitrias from Impala, Dain from
>> Presto, and Sriharsha from the schema registry project)."
>> 
>> I notice that impala has a syntax for caching.
>> 
>> https://www.cloudera.com/documentation/enterprise/5-8-x/topi
>> cs/impala_perf_hdfs_caching.html
>> 
>> Notice how the cache syntax did not way into Hive? It would make sense if
>> this feature trickled it's way into hive and use HDFS caching for example.
>> I have heard many people claim that using hive metastore is such a because
>> it is packaged weird (like with ORC), but again besides claim/complaining
>> no one has stepped up to deal with that.
>> 
>> What I would suggest is going forward for maybe a trial period of 6 months,
>> labeling JIRA tickets with a tag that would be
>> "SeeThisProvesWeNeedATLPMetastore". Because right now I do not enough
>> active use cases of people giving anything back to justify hurting our
>> workflow so much.
>> 
>> 
>> "I partially agree with you.  On the reviews, JIRAs, etc. I don’t think it
>> adds much, if any, overhead.  Hive is a big project and no one person knows
>> all the code anymore.  If you wanted to add a permanent macros feature you
>> would need reviews from someone who knows the parser (probably Pengcheng),
>> people who know the optimizer (Jesus, Ashutosh, …), and someone who knows
>> the metastore (me, Thejas, …).  And any large feature is going to be
>> implemented over multiple JIRAs, all of which are linkable regardless of
>> whether the JIRAs start with METASTORE- or HIVE-.   I also don’t think it
>> makes the feature disagreement any worse.  If the optimizer team absolutely
>> insists it has to have some feature and the metastore team insists that it
>> can’t have that feature you’re going to have to work through the issue
>> whether they all are in Hive or in two separate projects"
>> 
>> Macro was done in 1 patch and reviewed by 2 people. With 2-3 follow on
>> bugs.
>> 
>> https://issues.apache.org/jira/browse/HIVE-2655
>> 
>> I think your perception is different then mine because of circumstances. I
>> have waited weeks/months for reviews/merges (in Hive and other apache
>> projects) from mundane udfs to cassandra-storage-handlers. You obviously
>> work in a large company and you can more easily align objectives, go to the
>> water cooler and say "hey bob you know it would be cool if you can release
>> x so I can do y". When you are not in that situation its like, "hey mailing
>> list, my patch was done for three months now and like I have had to rebase
>> it three times and like I notice like other stuff is getting committed."
>> 
>> If you look at it tactically, "create permanent macro xzy". I go over to
>> calcite and suggest some changes there, if this concept is not "game
>> changer" it is probably going to sit unreviewed. If it is "game changer"
>> exciting that is 72 hours for release voting. Next go to hive-metastore
>> repeat the process, but remember now I have to "wow" the metastore people
>> with the "game changer" and if that crew is super focused on something
>> about kafka well now Hive features are second fiddle. Now lets say a hive
>> release is coming up, and I really want my feature in it.
>> hive-metastore-tlp might currently have a broken trunk because mongo wants
>> to add spaceships to wombats feature has a bug and frankly that should not
>> effect us.
>> 
>> I hate to draw in something else but I feel it is related:
>> 
>> 8 December 2016 : release 2.1.1 available
>> 07 April 2017 : release 1.2.2 available
>> hive-dev [DISCUSS] Supporting Hadoop-1 and experimental features
>> hive-dev Re: release chaos?
>> 
>> I have been vocal about not liking certain branching strategies and
>> proposals that take us away from releasable trunk. We have steadily headed
>> in a direction where we are pulling things out of hive, and we are not able
>> to turn out releases. We even had a thread "release chaos" talking about
>> our 5 active branches (with friends I say "jumped the shark"). Pulling out
>> the metastore is only going to make this worse. I do not even see the model
>> as successful. You may say it is great that calcite lets people share our
>> sql dialect or the ORC TLP has 5 committers, but if Hive can not get a
>> release out the door I do not see us optimizing for the proper thing.
>>

Re: [DISCUSS] Separating out the metastore as its own TLP

Reply via email to