Re: Introducing a DI framework in Hive?

2023-04-12 Thread Attila Turoczy
Hi Stamatis and Sungwoo,

Agree with several points. Hive has millions of LOC which is here and will
be with us in the same way, it is not a question. But we need to think
about the future of the project. There are no engineers in the world who
want to use old and legacy technologies, every engineer wants to use cool
staff where He/She can learn new stuff, patterns, designs. If we do not
improve on our codebase that will be a legacy zombieland, which won't be
touched by love and passion. *(Oh what a management bullshit - you can tell
:) )* But I truly think that if we introduce new principals it could give
us speed, motivation, and power to continue the innovation. As an engineer
I always want to use a modern approach, because this gives me more
excitement, I think that introducing a DI for this type of project is hard,
challenging and gives excitement. I want to live in a world where Hive is
the leader of the new principals, stable and easy to use, also the
on-boarding experience would be much much faster and easier.

I don't wanna live in a world 

As you wrote, the DI is powerful, and the hive does not contain it because
it became more widely used after the hive has started. If we / you
introduce it, it does not mean we have to refactor every module with DI.
But we can try to identify some components where we would introduce it,
also we could create a docs for others on how to use and implement it.
Maybe just 1-2 components, others will come later as we touch it, if it
does make sense. We won't remove every static utils class, because it would
not make sense, but with baby steps we could try to introduce, and for new
development we could introduce a loosely coupled standard, where every
dependency is more lightweight and also it would be easier to test these
components. (Which -could-  improves the quality as well)


#2 The quality of the 3.1.x vs 4.0.x is a bit different topic. I don't
think it has too many connections to the DI, but I think we should talk
about the root causes on different threads. You had several good points. We
- ALL - of us should be more careful about this type of issue. It was the
same in the past, especially when the hive 3 introduced there were several
similar issues. When new groundbreaking changes come to the repository it
could happen. Also I think the 4.0.0 alpha describes it as something that
is not solid stone. But anyhow you are right we have to be more careful!
But let's start a different thread about it


-Attila

On Wed, Apr 12, 2023 at 5:07 PM Sungwoo Park  wrote:

> Hello,
>
> I am not a committer, but I would like to add my opinion. At this stage of
> development, I think it is quite risky to switch to a DI framework for a
> couple of reasons.
>
> 1. A DI framework would have been a powerful tool if it had been
> incorporated into the project from the early stage. Now, however, Hive has
> way over 1 million lines of code and tens of thousands test cases, and my
> guess is that the overhead associated with introducing DI into Hive
> (whether gradually or globally at once) is very likely to outweigh the
> additional benefit, if any, of introducing DI, especially if we consider
> the stability of its development infrastructure.
>
> 2. Implementing new features, such as DI, in Hive can be an exciting
> sub-project and fun, but I think more pressing issues are to stabilize the
> current Hive code, although this is certainly less motivating and more
> boring. I hope that no new major features, such as DI, will be introduced
> until Hive becomes, say, as stable as Hive 3.1.
>
> For 2, I can give a few examples to substantiate my claim.
>
> 1) For the past few years, several new techniques for query compilation
> have been introduced. Unfortunately they were buggy and Hive started to
> return wrong results, on the assumption that Hive 3.1.2 was working
> correctly. (Yes, Hive 3.1.2 also has correctness bugs, but when tested
> against TPC-DS, Hive 3.1.2 returned the same results as other frameworks,
> so it can be used as a basis for comparison.) From our own testing, Hive
> 4.0.0-SNAPSHOT returns wrong results on several queries in TPC-DS, and this
> should be a major setback for Hive. If interested, please see [1] and [2].
>
> 2) Perhaps due to the same reason as in 1), Hive 4.0.0-SNAPSHOT is
> noticeably slower than Hive 3.1.2 on the TPC-DS benchmark. However, this is
> only from my own testing (using 10TB TPC-DS), and I hope that someone in
> the Hive team will try similar experiments to confirm/refute my claim.
>
> 3) Currently many q tests are run against MapReduce (which is not
> officially supported as far as I remember). However, some of these q tests
> fail when run against Tez. If Tez and LLAP are the new execution engines,
> these tests should be migrated as well.
>
> Sungwoo Park
>
> [1] https://issues.apache.org/jira/browse/HIVE-26654
> [2] https://issues.apache.org/jira/browse/HIVE-27226
>
> On Wed, Apr 12, 2023 at 10:12 PM Stama

Re: Introducing a DI framework in Hive?

2023-04-12 Thread Sungwoo Park
Hello,

I am not a committer, but I would like to add my opinion. At this stage of
development, I think it is quite risky to switch to a DI framework for a
couple of reasons.

1. A DI framework would have been a powerful tool if it had been
incorporated into the project from the early stage. Now, however, Hive has
way over 1 million lines of code and tens of thousands test cases, and my
guess is that the overhead associated with introducing DI into Hive
(whether gradually or globally at once) is very likely to outweigh the
additional benefit, if any, of introducing DI, especially if we consider
the stability of its development infrastructure.

2. Implementing new features, such as DI, in Hive can be an exciting
sub-project and fun, but I think more pressing issues are to stabilize the
current Hive code, although this is certainly less motivating and more
boring. I hope that no new major features, such as DI, will be introduced
until Hive becomes, say, as stable as Hive 3.1.

For 2, I can give a few examples to substantiate my claim.

1) For the past few years, several new techniques for query compilation
have been introduced. Unfortunately they were buggy and Hive started to
return wrong results, on the assumption that Hive 3.1.2 was working
correctly. (Yes, Hive 3.1.2 also has correctness bugs, but when tested
against TPC-DS, Hive 3.1.2 returned the same results as other frameworks,
so it can be used as a basis for comparison.) From our own testing, Hive
4.0.0-SNAPSHOT returns wrong results on several queries in TPC-DS, and this
should be a major setback for Hive. If interested, please see [1] and [2].

2) Perhaps due to the same reason as in 1), Hive 4.0.0-SNAPSHOT is
noticeably slower than Hive 3.1.2 on the TPC-DS benchmark. However, this is
only from my own testing (using 10TB TPC-DS), and I hope that someone in
the Hive team will try similar experiments to confirm/refute my claim.

3) Currently many q tests are run against MapReduce (which is not
officially supported as far as I remember). However, some of these q tests
fail when run against Tez. If Tez and LLAP are the new execution engines,
these tests should be migrated as well.

Sungwoo Park

[1] https://issues.apache.org/jira/browse/HIVE-26654
[2] https://issues.apache.org/jira/browse/HIVE-27226

On Wed, Apr 12, 2023 at 10:12 PM Stamatis Zampetakis 
wrote:

> Hey Laszlo,
>
> Dependency injection is a very powerful and useful tool/design pattern.
>
> I don't think there is a particular reason for which Hive does not use
> DI framework apart maybe from the fact that we have lots of legacy
> code that existed before DI became that popular.
>
> I am open to ideas and suggestions about parts of the code that we
> could improve via DI. I would probably avoid big refactorings to core
> components of Hive for the sake of introducing a DI framework but I
> see no big issue using such frameworks in new code. As usual when we
> are about to introduce a new dependency to the project we should be
> mindful of all the implications that this might have.
>
> It's hard to make a generally applicable claim that we should use this
> or that framework since I guess it has to do a lot with personal
> preferences; we tend to prefer things that we have already used. I
> haven't used DI frameworks that much so don't have a strong opinion on
> which framework is the best so I am willing to follow the majority.
>
> Best,
> Stamatis
>
> On Tue, Apr 4, 2023 at 1:19 PM Laszlo Vegh 
> wrote:
> >
> >
> > Hi all,
> >
> > I would like to start a conversation about introducing some Dependency
> Injection framework (like Spring, Guice, Weld, etc.) in Hive.
> >
> > IMHO the lack of such framework makes the codebase way less organised,
> and harder to maintain. Moreover, I think it also lead to introducing a
> huge amount of static/utility methods and classes (which is highly
> discouraged when using DI frameworks). When there is no DI framework,
> utility classes with static methods often seem to be the simplest and best
> way to share code across different Hive components/classes, but these
> constructs are really killing testability. For example it is much harder to
> mock static method calls, than mocking service/component instances. Poor
> testability is a major issue on its own, but having a DI framework could
> have much more benefit, like greater flexibility (modularity), better
> organised services, etc.
> >
> >
> > I’m interested if there’s any reason why there is no DI in Hive so far.
> I know there’s no way to introduce it everywhere in a single step, but we
> could start using it where it is easy to start, and continuously expand its
> usage from class to class. If there is no strong reason why no to do it, I
> would like to start an open conversation around this topic. (Possible
> benefits, drawbacks, which framework to use, where to introduce it first,
> etc.)
> >
> > If anybody is interested in this initiative, please join the
> conversation, and add your thoughts, i

Re: [DISCUSS] Move Jira notification emails out of dev@hive

2023-04-12 Thread Stamatis Zampetakis
INFRA-24440 is resolved so all JIRA traffic now goes to issues@hive.
Don't forget to subscribe to that list if you wish to follow the
creation of new tickets etc.

Best,
Stamatis

On Fri, Apr 7, 2023 at 9:55 AM Stamatis Zampetakis  wrote:
>
> Just logged https://issues.apache.org/jira/browse/INFRA-24440 to move
> this forward.
>
> Best,
> Stamatis
>
> On Thu, Mar 30, 2023 at 11:12 AM Stamatis Zampetakis  
> wrote:
> >
> > I will proceed with the changes needed to move the Jira traffic out of the 
> > dev list sometime next week.
> >
> > If there are reasons to delay or abandon the proposal please let me know.
> >
> > Best,
> > Stamatis
> >
> > On Mon, Mar 27, 2023, 5:39 AM Sungwoo Park  wrote:
> >>
> >> I like the proposal very much. (Then, hopefully this mailing list will
> >> be useful to outside contributors as well.)
> >>
> >> --- Sungwoo Park
> >>
> >> On Sat, 25 Mar 2023, Stamatis Zampetakis wrote:
> >>
> >> > Hi everyone,
> >> >
> >> > In the last Hive board report someone mentioned that the volume of Jira
> >> > notification emails to the dev list is huge especially when compared to
> >> > emails send by actual humans making it hard for someone to follow what's
> >> > happening in the project.
> >> >
> >> > I personally share their viewpoint. For a long time I have been relying 
> >> > on
> >> > client side (Gmail) filters to separate Jira notifications from other
> >> > emails to the dev list.
> >> >
> >> > I think it would be better to direct the traffic from jira to a separate
> >> > list namely jira@hive to keep the dev@hive list clean and dedicated to
> >> > human interaction.
> >> >
> >> > What do you think?
> >> >
> >> > Best,
> >> > Stamatis
> >> >


Re: Introducing a DI framework in Hive?

2023-04-12 Thread Stamatis Zampetakis
Hey Laszlo,

Dependency injection is a very powerful and useful tool/design pattern.

I don't think there is a particular reason for which Hive does not use
DI framework apart maybe from the fact that we have lots of legacy
code that existed before DI became that popular.

I am open to ideas and suggestions about parts of the code that we
could improve via DI. I would probably avoid big refactorings to core
components of Hive for the sake of introducing a DI framework but I
see no big issue using such frameworks in new code. As usual when we
are about to introduce a new dependency to the project we should be
mindful of all the implications that this might have.

It's hard to make a generally applicable claim that we should use this
or that framework since I guess it has to do a lot with personal
preferences; we tend to prefer things that we have already used. I
haven't used DI frameworks that much so don't have a strong opinion on
which framework is the best so I am willing to follow the majority.

Best,
Stamatis

On Tue, Apr 4, 2023 at 1:19 PM Laszlo Vegh  wrote:
>
>
> Hi all,
>
> I would like to start a conversation about introducing some Dependency 
> Injection framework (like Spring, Guice, Weld, etc.) in Hive.
>
> IMHO the lack of such framework makes the codebase way less organised, and 
> harder to maintain. Moreover, I think it also lead to introducing a huge 
> amount of static/utility methods and classes (which is highly discouraged 
> when using DI frameworks). When there is no DI framework, utility classes 
> with static methods often seem to be the simplest and best way to share code 
> across different Hive components/classes, but these constructs are really 
> killing testability. For example it is much harder to mock static method 
> calls, than mocking service/component instances. Poor testability is a major 
> issue on its own, but having a DI framework could have much more benefit, 
> like greater flexibility (modularity), better organised services, etc.
>
>
> I’m interested if there’s any reason why there is no DI in Hive so far. I 
> know there’s no way to introduce it everywhere in a single step, but we could 
> start using it where it is easy to start, and continuously expand its usage 
> from class to class. If there is no strong reason why no to do it, I would 
> like to start an open conversation around this topic. (Possible benefits, 
> drawbacks, which framework to use, where to introduce it first, etc.)
>
> If anybody is interested in this initiative, please join the conversation, 
> and add your thoughts, ideas, doubts, anything.
>
> Thanks,
>
> Laszlo Vegh
> veghlac...@gmail.com