Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Abhishek Agarwal Tue, 26 Jul 2022 03:20:03 -0700

Reviving this conversation again.
@Will - Do you still have concerns about HDFS stability? Hadoop 3 has been
around for some time now and is very stable as far as I know.


The dependencies coming from Hadoop 2 are also old enough that they cause
dependency scans to fail. E.g. Log4j 1.x dependencies that are coming from
Hadoop 2, get flagged during these scans. We have also seen issues when
customers try to use Hadoop ingestion with the latest log4j2 library.

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level;
at
org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393)
at
org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326)
at
org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303)


Instead of fixing these point issues, we would be better served by
completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent
releases and dependencies are well isolated.

On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <[email protected]>
wrote:

> Hello
> We can also use maven profiles. We keep hadoop2 support by default and add
> a new maven profile with hadoop3. This will allow the user to choose the
> profile which is best suited for the use case.
> Agreed, it will not help in the Hadoop dependency problems but does enable
> our users to use druid with multiple flavors.
> Also with hadoop3, as clint mentioned, the dependencies come pre-shaded so
> we significantly reduce our effort in solving the dependency problems.
> I have the PR in the last phases where I am able to run the entire test
> suit unit + integration tests on both the default ie hadoop2 and the new
> hadoop3 profile.
>
>
>
> On 2021/06/09 11:55:31, Will Lauer <[email protected]>
> wrote:
> > Clint,
> >
> > I fully understand what type of headache dealing with these dependency
> > issues is. We deal with this all the time, and based on conversations
> I've
> > had with our internal hadoop development team, they are quite aware of
> them
> > and just as frustrated by them as you are. I'm certainly in favor of
> doing
> > something to improve this situation, as long as it doesn't abandon a
> large
> > section of the user base, which I think DROPPING hadoop2 would do.
> >
> > I think there are solutions there that can help solve the conflicting
> > dependency problem. Refactoring Hadoop support into an independent
> > extension is certainly a start. But I think the dependency problem is
> > bigger than that. There are always going to be conflicts between
> > dependencies in the core system and in extensions as the system gets
> > bigger. We have one right now internally that prevents us from enabling
> SQL
> > in our instance of Druid due to conflicts between versions of protobuf
> used
> > by Calcite vs one of our critical extensions. Long term, I think you are
> > going to need to carefully think through a ClassLoader based strategy to
> > truly separate the impact of various dependencies.
> >
> > While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> > problem. It's a system that allows you to explicitly declare what each
> > bundle exposes to the system, and what each bundle consumes from the
> > system, allowing multiple conflicting dependencies to co-exist without
> > impacting each other. OSGi is the big hammer approach, but I bet a more
> > appropriate solution would be a simpler custom-ClassLoader based solution
> > that hid all dependencies in extensions, keeping them from impacting the
> > core, and that only exposed "public" pieces of the core to extensions. If
> > Druid's core could be extended without impacting the various extensions,
> > and the extensions' dependencies could be modified without impacting the
> > core, this would go a long way towards solving the problem that you have
> > described.
> >
> > Will
> >
> > <http://www.verizonmedia.com>
> >
> > Will Lauer
> >
> > Senior Principal Architect, Audience & Advertising Reporting
> > Data Platforms & Systems Engineering
> >
> > M 508 561 6427
> > 1908 S. First St
> > Champaign, IL 61822
> >
> > <http://www.facebook.com/verizonmedia>   <
> http://twitter.com/verizonmedia>
> > <https://www.linkedin.com/company/verizon-media/>
> > <http://www.instagram.com/verizonmedia>
> >
> >
> >
> > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <[email protected]> wrote:
> >
> > > @itai, I think pending the outcome of this discussion that it makes
> sense
> > > to have a wider community thread to announce any decisions we make
> here,
> > > thanks for bringing that up.
> > >
> > > @rajiv, Minio support seems unrelated to this discussion. It seems
> like a
> > > reasonable request, but I recommend starting another thread to see if
> > > someone is interested in taking up this effort.
> > >
> > > @jihoon I definitely agree that Hadoop should be refactored to be an
> > > extension longer term. I don't think this upgrade would necessarily
> > > make doing such a refactor any easier, but not harder either. Just
> moving
> > > Hadoop to an extension also unfortunately doesn't really do anything to
> > > help our dependency problem though, which is the thing that has
> agitated me
> > > enough to start this thread and start looking into solutions.
> > >
> > > @will/@frank I feel like the stranglehold Hadoop has on our
> dependencies
> > > has started to become especially more painful in the last couple of
> > > years. Most painful to me is that we are stuck using a version of
> Apache
> > > Calcite from 2019 (six versions behind the latest), because newer
> versions
> > > require a newer version of Guava. This means we cannot get any bug
> fixes
> > > and improvements in our SQL parsing layer without doing something like
> > > packaging a shaded version of it ourselves or solving our Hadoop
> dependency
> > > problem.
> > >
> > > Many other dependencies have also proved problematic with Hadoop as
> well in
> > > the past, and since we aren't able to run the Hadoop integration tests
> in
> > > Travis, there is always the chance that sometimes we don't catch these
> when
> > > they go in. I imagine now that we have turned on dependabot this week,
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > > , that we are going to have to
> > > proceed very carefully with it until we are able to resolve this
> dependency
> > > issue.
> > >
> > > Hadoop 3.3.0 is also the first to support running on a Java version
> that is
> > > newer than Java 8 per
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > > ,
> > > which is another area we have been working towards - Druid to
> officially
> > > support Java 11+ environments.
> > >
> > > I'm sort of at a loss of what else to do besides one of
> > > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > > - figuring out how to custom package our own Hadoop 2.x dependendencies
> > > that are shaded similarly to the Hadoop 3 client jars, and only
> supporting
> > > Hadoop with application classpath isolation (mapreduce.job.classloader
> =
> > > true)
> > > - just dropping support for Hadoop completely
> > >
> > > I would much rather devote all effort into making Druids native batch
> > > ingestion better to encourage people to migrate to that, than
> continuing to
> > > fight with figuring out how to keep supporting Hadoop, so upgrading and
> > > switching to the shaded client jars at least seemed like a reasonable
> > > compromise to dropping it completely. Maybe making custom shaded Hadoop
> > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard
> as I
> > > am imagining, but it does seem like the most amount of work between the
> > > solutions I could think of to potentially resolve this problem.
> > >
> > > Does anyone have any other ideas of how we can isolate our dependencies
> > > from Hadoop? Solutions like shading Guava,
> > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > > , would let Druid itself use
> > > newer Guava, but that doesn't help conflicts within our dependencies
> which
> > > has always seemed to be the larger problem to me. Moving Hadoop
> support to
> > > an extension doesn't help anything unless we can ensure that we can run
> > > Druid ingestion tasks on Hadoop without having to match all of the
> Hadoop
> > > clusters dependencies with some sort of classloader wizardry.
> > >
> > > Maybe we could consider keeping a 0.22.x release line in Druid that
> gets
> > > security and minor bug fixes for some period of time to give people a
> > > longer period to migrate off of Hadoop 2.x? I can't speak for the rest
> of
> > > the committers, but I would personally be more open to maintaining
> such a
> > > branch if it meant that moving forward at least we could update all of
> our
> > > dependencies to newer versions, while providing a transition path to
> still
> > > have at least some support until migrating to Hadoop 3 or native Druid
> > > batch ingestion.
> > >
> > > Any other ideas?
> > >
> > >
> > >
> > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <[email protected]>
> wrote:
> > >
> > > > Considering Druid takes advantage of lots of external components to
> > > work, I
> > > > think we should upgrade Druid in a little bit conservitive way.
> Dropping
> > > > support of hadoop2 is not a good idea.
> > > > The upgrading of the ZooKeeper client in Druid also prevents me from
> > > > adopting 0.22 for a longer time.
> > > >
> > > > Although users could upgrade these dependencies first to use the
> latest
> > > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > > production and usually take longer time, which would prevent users
> from
> > > > experiencing new features of Druid.
> > > > For hadoop3, I have heard of some performance issues, which also
> makes me
> > > > have no confidence to upgrade.
> > > >
> > > > I think what Jihoon proposes is a good idea, separating hadoop2 from
> > > Druid
> > > > core as an extension.
> > > > Since hadoop2 has not been EOF, to achieve balance between
> compatibility
> > > > and long term evolution, maybe we could provide two extensions, one
> for
> > > > hadoop2, one for hadoop3.
> > > >
> > > >
> > > >
> > > > Will Lauer <[email protected]> 于2021年6月9日周三 上午4:13写道：
> > > >
> > > > > Just to follow up on this, our main problem with hadoop3 right now
> has
> > > > been
> > > > > instability in HDFS, to the extent that we put on hold any plans to
> > > > deploy
> > > > > it to our production systems. I would claim Hadoop3 isn't mature
> enough
> > > > yet
> > > > > to consider migrating Druid to it.
> > > > >
> > > > > WIll
> > > > >
> > > > > <http://www.verizonmedia.com>
> > > > >
> > > > > Will Lauer
> > > > >
> > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > Data Platforms & Systems Engineering
> > > > >
> > > > > M 508 561 6427
> > > > > 1908 S. First St
> > > > > Champaign, IL 61822
> > > > >
> > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > >   <
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > >
> > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > >
> > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Unfortunately, the migration off of hadoop3 is a hard one (maybe
> not
> > > > for
> > > > > > Druid, but certainly for big organizations running large hadoop2
> > > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > > probably
> > > > > > prevent me from taking any new versions of Druid for at least the
> > > > > remainder
> > > > > > of the year and possibly longer.
> > > > > >
> > > > > > Will
> > > > > >
> > > > > >
> > > > > > <http://www.verizonmedia.com>
> > > > > >
> > > > > > Will Lauer
> > > > > >
> > > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > > Data Platforms & Systems Engineering
> > > > > >
> > > > > > M 508 561 6427
> > > > > > 1908 S. First St
> > > > > > Champaign, IL 61822
> > > > > >
> > > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > > >   <
> > > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > > >
> > > > > >    <
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > > >
> > > > > > <
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <[email protected]>
> > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been assisting with some experiments to see how we might
> want
> > > to
> > > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see
> if
> > > > maybe
> > > > > we
> > > > > >> can finally be free of some of the dependency issues it has been
> > > > causing
> > > > > >> for as long as I can remember working with Druid.
> > > > > >>
> > > > > >> Hadoop 3 introduced shaded client jars,
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > > >> , with the purpose to
> > > > > >> allow applications to talk to the Hadoop cluster without
> drowning in
> > > > its
> > > > > >> transitive dependencies. The experimental branch that I have
> been
> > > > > helping
> > > > > >> with, which is using these new shaded client jars, can be seen
> in
> > > this
> > > > > PR
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > > >> , and is currently working with
> > > > > >> the HDFS integration tests as well as the Hadoop tutorial flow
> in
> > > the
> > > > > >> Druid
> > > > > >> docs (which is pretty much equivalent to the HDFS integration
> test).
> > > > > >>
> > > > > >> The cloud deep storages still need some further testing and some
> > > minor
> > > > > >> cleanup still needs done for the docs and such. Additionally we
> > > still
> > > > > need
> > > > > >> to figure out how to handle the Kerberos extension, because it
> > > extends
> > > > > >> some
> > > > > >> Hadoop classes so isn't able to use the shaded client jars in a
> > > > > >> straight-forward manner, and so still has heavy dependencies and
> > > > hasn't
> > > > > >> been tested. However, the experiment has started to pan out
> enough
> > > to
> > > > > >> where
> > > > > >> I think it is worth starting this discussion, because it does
> have
> > > > some
> > > > > >> implications.
> > > > > >>
> > > > > >> Making this change I think will allow us to update our
> dependencies
> > > > > with a
> > > > > >> lot more freedom (I'm looking at you, Guava), but the catch is
> that
> > > > once
> > > > > >> we
> > > > > >> make this change and start updating these dependencies, it will
> > > become
> > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far as
> I
> > > know
> > > > > >> there isn't an equivalent set of shaded client jars. I am also
> not
> > > > > certain
> > > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > > >> (mapreduce.job.classloader = true) which I think is required to
> be
> > > set
> > > > > on
> > > > > >> Druid tasks for this shaded stuff to work alongside updated
> Druid
> > > > > >> dependencies.
> > > > > >>
> > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x
> support
> > > > after
> > > > > >> the
> > > > > >> Druid 0.22 release?
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

Reply via email to