Hello,

I am not a committer, but I would like to add my opinion. At this stage of
development, I think it is quite risky to switch to a DI framework for a
couple of reasons.

1. A DI framework would have been a powerful tool if it had been
incorporated into the project from the early stage. Now, however, Hive has
way over 1 million lines of code and tens of thousands test cases, and my
guess is that the overhead associated with introducing DI into Hive
(whether gradually or globally at once) is very likely to outweigh the
additional benefit, if any, of introducing DI, especially if we consider
the stability of its development infrastructure.

2. Implementing new features, such as DI, in Hive can be an exciting
sub-project and fun, but I think more pressing issues are to stabilize the
current Hive code, although this is certainly less motivating and more
boring. I hope that no new major features, such as DI, will be introduced
until Hive becomes, say, as stable as Hive 3.1.

For 2, I can give a few examples to substantiate my claim.

1) For the past few years, several new techniques for query compilation
have been introduced. Unfortunately they were buggy and Hive started to
return wrong results, on the assumption that Hive 3.1.2 was working
correctly. (Yes, Hive 3.1.2 also has correctness bugs, but when tested
against TPC-DS, Hive 3.1.2 returned the same results as other frameworks,
so it can be used as a basis for comparison.) From our own testing, Hive
4.0.0-SNAPSHOT returns wrong results on several queries in TPC-DS, and this
should be a major setback for Hive. If interested, please see [1] and [2].

2) Perhaps due to the same reason as in 1), Hive 4.0.0-SNAPSHOT is
noticeably slower than Hive 3.1.2 on the TPC-DS benchmark. However, this is
only from my own testing (using 10TB TPC-DS), and I hope that someone in
the Hive team will try similar experiments to confirm/refute my claim.

3) Currently many q tests are run against MapReduce (which is not
officially supported as far as I remember). However, some of these q tests
fail when run against Tez. If Tez and LLAP are the new execution engines,
these tests should be migrated as well.

Sungwoo Park

[1] https://issues.apache.org/jira/browse/HIVE-26654
[2] https://issues.apache.org/jira/browse/HIVE-27226

On Wed, Apr 12, 2023 at 10:12 PM Stamatis Zampetakis <zabe...@gmail.com>
wrote:

> Hey Laszlo,
>
> Dependency injection is a very powerful and useful tool/design pattern.
>
> I don't think there is a particular reason for which Hive does not use
> DI framework apart maybe from the fact that we have lots of legacy
> code that existed before DI became that popular.
>
> I am open to ideas and suggestions about parts of the code that we
> could improve via DI. I would probably avoid big refactorings to core
> components of Hive for the sake of introducing a DI framework but I
> see no big issue using such frameworks in new code. As usual when we
> are about to introduce a new dependency to the project we should be
> mindful of all the implications that this might have.
>
> It's hard to make a generally applicable claim that we should use this
> or that framework since I guess it has to do a lot with personal
> preferences; we tend to prefer things that we have already used. I
> haven't used DI frameworks that much so don't have a strong opinion on
> which framework is the best so I am willing to follow the majority.
>
> Best,
> Stamatis
>
> On Tue, Apr 4, 2023 at 1:19 PM Laszlo Vegh <lv...@cloudera.com.invalid>
> wrote:
> >
> >
> > Hi all,
> >
> > I would like to start a conversation about introducing some Dependency
> Injection framework (like Spring, Guice, Weld, etc.) in Hive.
> >
> > IMHO the lack of such framework makes the codebase way less organised,
> and harder to maintain. Moreover, I think it also lead to introducing a
> huge amount of static/utility methods and classes (which is highly
> discouraged when using DI frameworks). When there is no DI framework,
> utility classes with static methods often seem to be the simplest and best
> way to share code across different Hive components/classes, but these
> constructs are really killing testability. For example it is much harder to
> mock static method calls, than mocking service/component instances. Poor
> testability is a major issue on its own, but having a DI framework could
> have much more benefit, like greater flexibility (modularity), better
> organised services, etc.
> >
> >
> > I’m interested if there’s any reason why there is no DI in Hive so far.
> I know there’s no way to introduce it everywhere in a single step, but we
> could start using it where it is easy to start, and continuously expand its
> usage from class to class. If there is no strong reason why no to do it, I
> would like to start an open conversation around this topic. (Possible
> benefits, drawbacks, which framework to use, where to introduce it first,
> etc.)
> >
> > If anybody is interested in this initiative, please join the
> conversation, and add your thoughts, ideas, doubts, anything.
> >
> > Thanks,
> >
> > Laszlo Vegh
> > veghlac...@gmail.com <mailto:veghlac...@gmail.com>
>

Reply via email to