Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Samarth Jain Tue, 26 Jul 2022 11:33:00 -0700

I am sure there are other companies out there who are still on Hadoop 2.x
with migration to Hadoop 3.x being a no-go.
If Druid was to drop support for Hadoop 3.x completely, I am afraid it
would prevent users from updating to newer versions of Druid which would be
a shame.


FWIW, we have found in practice for high volume use cases that compaction
based on Druid's Hadoop based batch ingestion is a lot more scale-able than
the native compaction.

Having said that, as an alternative, if we can merge Julian's Spark based
ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid, that
might provide an alternate way for users to get rid of the Hadoop
dependency.

On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <abhishek.agar...@imply.io>
wrote:

> Reviving this conversation again.
> @Will - Do you still have concerns about HDFS stability? Hadoop 3 has been
> around for some time now and is very stable as far as I know.
>
> The dependencies coming from Hadoop 2 are also old enough that they cause
> dependency scans to fail. E.g. Log4j 1.x dependencies that are coming from
> Hadoop 2, get flagged during these scans. We have also seen issues when
> customers try to use Hadoop ingestion with the latest log4j2 library.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
>
> org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
> at
>
> org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
> at
>
> org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
> at
>
> org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)
>
>
> Instead of fixing these point issues, we would be better served by
> completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
> releases and dependencies are well isolated.
>
> On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <karankumar1...@gmail.com>
> wrote:
>
> > Hello
> > We can also use maven profiles. We keep hadoop2 support by default and
> add
> > a new maven profile with hadoop3. This will allow the user to choose the
> > profile which is best suited for the use case.
> > Agreed, it will not help in the Hadoop dependency problems but does
> enable
> > our users to use druid with multiple flavors.
> > Also with hadoop3, as clint mentioned, the dependencies come pre-shaded
> so
> > we significantly reduce our effort in solving the dependency problems.
> > I have the PR in the last phases where I am able to run the entire test
> > suit unit + integration tests on both the default ie hadoop2 and the new
> > hadoop3 profile.
> >
> >
> >
> > On 2021/06/09 11:55:31, Will Lauer <wla...@verizonmedia.com.INVALID>
> > wrote:
> > > Clint,
> > >
> > > I fully understand what type of headache dealing with these dependency
> > > issues is. We deal with this all the time, and based on conversations
> > I've
> > > had with our internal hadoop development team, they are quite aware of
> > them
> > > and just as frustrated by them as you are. I'm certainly in favor of
> > doing
> > > something to improve this situation, as long as it doesn't abandon a
> > large
> > > section of the user base, which I think DROPPING hadoop2 would do.
> > >
> > > I think there are solutions there that can help solve the conflicting
> > > dependency problem. Refactoring Hadoop support into an independent
> > > extension is certainly a start. But I think the dependency problem is
> > > bigger than that. There are always going to be conflicts between
> > > dependencies in the core system and in extensions as the system gets
> > > bigger. We have one right now internally that prevents us from enabling
> > SQL
> > > in our instance of Druid due to conflicts between versions of protobuf
> > used
> > > by Calcite vs one of our critical extensions. Long term, I think you
> are
> > > going to need to carefully think through a ClassLoader based strategy
> to
> > > truly separate the impact of various dependencies.
> > >
> > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> > > problem. It's a system that allows you to explicitly declare what each
> > > bundle exposes to the system, and what each bundle consumes from the
> > > system, allowing multiple conflicting dependencies to co-exist without
> > > impacting each other. OSGi is the big hammer approach, but I bet a more
> > > appropriate solution would be a simpler custom-ClassLoader based
> solution
> > > that hid all dependencies in extensions, keeping them from impacting
> the
> > > core, and that only exposed "public" pieces of the core to extensions.
> If
> > > Druid's core could be extended without impacting the various
> extensions,
> > > and the extensions' dependencies could be modified without impacting
> the
> > > core, this would go a long way towards solving the problem that you
> have
> > > described.
> > >
> > > Will
> > >
> > > <http://www.verizonmedia.com>
> > >
> > > Will Lauer
> > >
> > > Senior Principal Architect, Audience & Advertising Reporting
> > > Data Platforms & Systems Engineering
> > >
> > > M 508 561 6427
> > > 1908 S. First St
> > > Champaign, IL 61822
> > >
> > > <http://www.facebook.com/verizonmedia>   <
> > http://twitter.com/verizonmedia>
> > > <https://www.linkedin.com/company/verizon-media/>
> > > <http://www.instagram.com/verizonmedia>
> > >
> > >
> > >
> > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cwy...@apache.org> wrote:
> > >
> > > > @itai, I think pending the outcome of this discussion that it makes
> > sense
> > > > to have a wider community thread to announce any decisions we make
> > here,
> > > > thanks for bringing that up.
> > > >
> > > > @rajiv, Minio support seems unrelated to this discussion. It seems
> > like a
> > > > reasonable request, but I recommend starting another thread to see if
> > > > someone is interested in taking up this effort.
> > > >
> > > > @jihoon I definitely agree that Hadoop should be refactored to be an
> > > > extension longer term. I don't think this upgrade would necessarily
> > > > make doing such a refactor any easier, but not harder either. Just
> > moving
> > > > Hadoop to an extension also unfortunately doesn't really do anything
> to
> > > > help our dependency problem though, which is the thing that has
> > agitated me
> > > > enough to start this thread and start looking into solutions.
> > > >
> > > > @will/@frank I feel like the stranglehold Hadoop has on our
> > dependencies
> > > > has started to become especially more painful in the last couple of
> > > > years. Most painful to me is that we are stuck using a version of
> > Apache
> > > > Calcite from 2019 (six versions behind the latest), because newer
> > versions
> > > > require a newer version of Guava. This means we cannot get any bug
> > fixes
> > > > and improvements in our SQL parsing layer without doing something
> like
> > > > packaging a shaded version of it ourselves or solving our Hadoop
> > dependency
> > > > problem.
> > > >
> > > > Many other dependencies have also proved problematic with Hadoop as
> > well in
> > > > the past, and since we aren't able to run the Hadoop integration
> tests
> > in
> > > > Travis, there is always the chance that sometimes we don't catch
> these
> > when
> > > > they go in. I imagine now that we have turned on dependabot this
> week,
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > > , that we are going to have to
> > > > proceed very carefully with it until we are able to resolve this
> > dependency
> > > > issue.
> > > >
> > > > Hadoop 3.3.0 is also the first to support running on a Java version
> > that is
> > > > newer than Java 8 per
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > > ,
> > > > which is another area we have been working towards - Druid to
> > officially
> > > > support Java 11+ environments.
> > > >
> > > > I'm sort of at a loss of what else to do besides one of
> > > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > > - figuring out how to custom package our own Hadoop 2.x
> dependendencies
> > > > that are shaded similarly to the Hadoop 3 client jars, and only
> > supporting
> > > > Hadoop with application classpath isolation
> (mapreduce.job.classloader
> > =
> > > > true)
> > > > - just dropping support for Hadoop completely
> > > >
> > > > I would much rather devote all effort into making Druids native batch
> > > > ingestion better to encourage people to migrate to that, than
> > continuing to
> > > > fight with figuring out how to keep supporting Hadoop, so upgrading
> and
> > > > switching to the shaded client jars at least seemed like a reasonable
> > > > compromise to dropping it completely. Maybe making custom shaded
> Hadoop
> > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard
> > as I
> > > > am imagining, but it does seem like the most amount of work between
> the
> > > > solutions I could think of to potentially resolve this problem.
> > > >
> > > > Does anyone have any other ideas of how we can isolate our
> dependencies
> > > > from Hadoop? Solutions like shading Guava,
> > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > > , would let Druid itself use
> > > > newer Guava, but that doesn't help conflicts within our dependencies
> > which
> > > > has always seemed to be the larger problem to me. Moving Hadoop
> > support to
> > > > an extension doesn't help anything unless we can ensure that we can
> run
> > > > Druid ingestion tasks on Hadoop without having to match all of the
> > Hadoop
> > > > clusters dependencies with some sort of classloader wizardry.
> > > >
> > > > Maybe we could consider keeping a 0.22.x release line in Druid that
> > gets
> > > > security and minor bug fixes for some period of time to give people a
> > > > longer period to migrate off of Hadoop 2.x? I can't speak for the
> rest
> > of
> > > > the committers, but I would personally be more open to maintaining
> > such a
> > > > branch if it meant that moving forward at least we could update all
> of
> > our
> > > > dependencies to newer versions, while providing a transition path to
> > still
> > > > have at least some support until migrating to Hadoop 3 or native
> Druid
> > > > batch ingestion.
> > > >
> > > > Any other ideas?
> > > >
> > > >
> > > >
> > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <frankc...@apache.org>
> > wrote:
> > > >
> > > > > Considering Druid takes advantage of lots of external components to
> > > > work, I
> > > > > think we should upgrade Druid in a little bit conservitive way.
> > Dropping
> > > > > support of hadoop2 is not a good idea.
> > > > > The upgrading of the ZooKeeper client in Druid also prevents me
> from
> > > > > adopting 0.22 for a longer time.
> > > > >
> > > > > Although users could upgrade these dependencies first to use the
> > latest
> > > > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > > > production and usually take longer time, which would prevent users
> > from
> > > > > experiencing new features of Druid.
> > > > > For hadoop3, I have heard of some performance issues, which also
> > makes me
> > > > > have no confidence to upgrade.
> > > > >
> > > > > I think what Jihoon proposes is a good idea, separating hadoop2
> from
> > > > Druid
> > > > > core as an extension.
> > > > > Since hadoop2 has not been EOF, to achieve balance between
> > compatibility
> > > > > and long term evolution, maybe we could provide two extensions, one
> > for
> > > > > hadoop2, one for hadoop3.
> > > > >
> > > > >
> > > > >
> > > > > Will Lauer <wla...@verizonmedia.com.invalid> 于2021年6月9日周三
> 上午4:13写道：
> > > > >
> > > > > > Just to follow up on this, our main problem with hadoop3 right
> now
> > has
> > > > > been
> > > > > > instability in HDFS, to the extent that we put on hold any plans
> to
> > > > > deploy
> > > > > > it to our production systems. I would claim Hadoop3 isn't mature
> > enough
> > > > > yet
> > > > > > to consider migrating Druid to it.
> > > > > >
> > > > > > WIll
> > > > > >
> > > > > > <http://www.verizonmedia.com>
> > > > > >
> > > > > > Will Lauer
> > > > > >
> > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > Data Platforms & Systems Engineering
> > > > > >
> > > > > > M 508 561 6427
> > > > > > 1908 S. First St
> > > > > > Champaign, IL 61822
> > > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > >   <
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > >
> > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <
> wla...@verizonmedia.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Unfortunately, the migration off of hadoop3 is a hard one
> (maybe
> > not
> > > > > for
> > > > > > > Druid, but certainly for big organizations running large
> hadoop2
> > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > > > probably
> > > > > > > prevent me from taking any new versions of Druid for at least
> the
> > > > > > remainder
> > > > > > > of the year and possibly longer.
> > > > > > >
> > > > > > > Will
> > > > > > >
> > > > > > >
> > > > > > > <http://www.verizonmedia.com>
> > > > > > >
> > > > > > > Will Lauer
> > > > > > >
> > > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > > Data Platforms & Systems Engineering
> > > > > > >
> > > > > > > M 508 561 6427
> > > > > > > 1908 S. First St
> > > > > > > Champaign, IL 61822
> > > > > > >
> > > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > > >   <
> > > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > > >
> > > > > > >    <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > > >
> > > > > > > <
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cwy...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I've been assisting with some experiments to see how we might
> > want
> > > > to
> > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see
> > if
> > > > > maybe
> > > > > > we
> > > > > > >> can finally be free of some of the dependency issues it has
> been
> > > > > causing
> > > > > > >> for as long as I can remember working with Druid.
> > > > > > >>
> > > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > > >> , with the purpose to
> > > > > > >> allow applications to talk to the Hadoop cluster without
> > drowning in
> > > > > its
> > > > > > >> transitive dependencies. The experimental branch that I have
> > been
> > > > > > helping
> > > > > > >> with, which is using these new shaded client jars, can be seen
> > in
> > > > this
> > > > > > PR
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > > >> , and is currently working with
> > > > > > >> the HDFS integration tests as well as the Hadoop tutorial flow
> > in
> > > > the
> > > > > > >> Druid
> > > > > > >> docs (which is pretty much equivalent to the HDFS integration
> > test).
> > > > > > >>
> > > > > > >> The cloud deep storages still need some further testing and
> some
> > > > minor
> > > > > > >> cleanup still needs done for the docs and such. Additionally
> we
> > > > still
> > > > > > need
> > > > > > >> to figure out how to handle the Kerberos extension, because it
> > > > extends
> > > > > > >> some
> > > > > > >> Hadoop classes so isn't able to use the shaded client jars in
> a
> > > > > > >> straight-forward manner, and so still has heavy dependencies
> and
> > > > > hasn't
> > > > > > >> been tested. However, the experiment has started to pan out
> > enough
> > > > to
> > > > > > >> where
> > > > > > >> I think it is worth starting this discussion, because it does
> > have
> > > > > some
> > > > > > >> implications.
> > > > > > >>
> > > > > > >> Making this change I think will allow us to update our
> > dependencies
> > > > > > with a
> > > > > > >> lot more freedom (I'm looking at you, Guava), but the catch is
> > that
> > > > > once
> > > > > > >> we
> > > > > > >> make this change and start updating these dependencies, it
> will
> > > > become
> > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far
> as
> > I
> > > > know
> > > > > > >> there isn't an equivalent set of shaded client jars. I am also
> > not
> > > > > > certain
> > > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > > >> (mapreduce.job.classloader = true) which I think is required
> to
> > be
> > > > set
> > > > > > on
> > > > > > >> Druid tasks for this shaded stuff to work alongside updated
> > Druid
> > > > > > >> dependencies.
> > > > > > >>
> > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> > support
> > > > > after
> > > > > > >> the
> > > > > > >> Druid 0.22 release?
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Reply via email to