Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Karan Kumar Mon, 11 Oct 2021 23:35:17 -0700

Hello 
We can also use maven profiles. We keep hadoop2 support by default and add a 
new maven profile with hadoop3. This will allow the user to choose the profile 
which is best suited for the use case. 
Agreed, it will not help in the Hadoop dependency problems but does enable our 
users to use druid with multiple flavors. 
Also with hadoop3, as clint mentioned, the dependencies come pre-shaded so we 
significantly reduce our effort in solving the dependency problems. 
I have the PR in the last phases where I am able to run the entire test suit 
unit + integration tests on both the default ie hadoop2 and the new hadoop3 
profile.




On 2021/06/09 11:55:31, Will Lauer <[email protected]> wrote: 
> Clint,
> 
> I fully understand what type of headache dealing with these dependency
> issues is. We deal with this all the time, and based on conversations I've
> had with our internal hadoop development team, they are quite aware of them
> and just as frustrated by them as you are. I'm certainly in favor of doing
> something to improve this situation, as long as it doesn't abandon a large
> section of the user base, which I think DROPPING hadoop2 would do.
> 
> I think there are solutions there that can help solve the conflicting
> dependency problem. Refactoring Hadoop support into an independent
> extension is certainly a start. But I think the dependency problem is
> bigger than that. There are always going to be conflicts between
> dependencies in the core system and in extensions as the system gets
> bigger. We have one right now internally that prevents us from enabling SQL
> in our instance of Druid due to conflicts between versions of protobuf used
> by Calcite vs one of our critical extensions. Long term, I think you are
> going to need to carefully think through a ClassLoader based strategy to
> truly separate the impact of various dependencies.
> 
> While I'm not seriously suggesting it for Druid, OSGi WOULD solve this
> problem. It's a system that allows you to explicitly declare what each
> bundle exposes to the system, and what each bundle consumes from the
> system, allowing multiple conflicting dependencies to co-exist without
> impacting each other. OSGi is the big hammer approach, but I bet a more
> appropriate solution would be a simpler custom-ClassLoader based solution
> that hid all dependencies in extensions, keeping them from impacting the
> core, and that only exposed "public" pieces of the core to extensions. If
> Druid's core could be extended without impacting the various extensions,
> and the extensions' dependencies could be modified without impacting the
> core, this would go a long way towards solving the problem that you have
> described.
> 
> Will
> 
> <http://www.verizonmedia.com>
> 
> Will Lauer
> 
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
> 
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
> 
> <http://www.facebook.com/verizonmedia>   <http://twitter.com/verizonmedia>
> <https://www.linkedin.com/company/verizon-media/>
> <http://www.instagram.com/verizonmedia>
> 
> 
> 
> On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <[email protected]> wrote:
> 
> > @itai, I think pending the outcome of this discussion that it makes sense
> > to have a wider community thread to announce any decisions we make here,
> > thanks for bringing that up.
> >
> > @rajiv, Minio support seems unrelated to this discussion. It seems like a
> > reasonable request, but I recommend starting another thread to see if
> > someone is interested in taking up this effort.
> >
> > @jihoon I definitely agree that Hadoop should be refactored to be an
> > extension longer term. I don't think this upgrade would necessarily
> > make doing such a refactor any easier, but not harder either. Just moving
> > Hadoop to an extension also unfortunately doesn't really do anything to
> > help our dependency problem though, which is the thing that has agitated me
> > enough to start this thread and start looking into solutions.
> >
> > @will/@frank I feel like the stranglehold Hadoop has on our dependencies
> > has started to become especially more painful in the last couple of
> > years. Most painful to me is that we are stuck using a version of Apache
> > Calcite from 2019 (six versions behind the latest), because newer versions
> > require a newer version of Guava. This means we cannot get any bug fixes
> > and improvements in our SQL parsing layer without doing something like
> > packaging a shaded version of it ourselves or solving our Hadoop dependency
> > problem.
> >
> > Many other dependencies have also proved problematic with Hadoop as well in
> > the past, and since we aren't able to run the Hadoop integration tests in
> > Travis, there is always the chance that sometimes we don't catch these when
> > they go in. I imagine now that we have turned on dependabot this week,
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e=
> > , that we are going to have to
> > proceed very carefully with it until we are able to resolve this dependency
> > issue.
> >
> > Hadoop 3.3.0 is also the first to support running on a Java version that is
> > newer than Java 8 per
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e=
> > ,
> > which is another area we have been working towards - Druid to officially
> > support Java 11+ environments.
> >
> > I'm sort of at a loss of what else to do besides one of
> > - switching to these Hadoop 3 shaded jars and dropping 2.x support
> > - figuring out how to custom package our own Hadoop 2.x dependendencies
> > that are shaded similarly to the Hadoop 3 client jars, and only supporting
> > Hadoop with application classpath isolation (mapreduce.job.classloader =
> > true)
> > - just dropping support for Hadoop completely
> >
> > I would much rather devote all effort into making Druids native batch
> > ingestion better to encourage people to migrate to that, than continuing to
> > fight with figuring out how to keep supporting Hadoop, so upgrading and
> > switching to the shaded client jars at least seemed like a reasonable
> > compromise to dropping it completely. Maybe making custom shaded Hadoop
> > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I
> > am imagining, but it does seem like the most amount of work between the
> > solutions I could think of to potentially resolve this problem.
> >
> > Does anyone have any other ideas of how we can isolate our dependencies
> > from Hadoop? Solutions like shading Guava,
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e=
> > , would let Druid itself use
> > newer Guava, but that doesn't help conflicts within our dependencies which
> > has always seemed to be the larger problem to me. Moving Hadoop support to
> > an extension doesn't help anything unless we can ensure that we can run
> > Druid ingestion tasks on Hadoop without having to match all of the Hadoop
> > clusters dependencies with some sort of classloader wizardry.
> >
> > Maybe we could consider keeping a 0.22.x release line in Druid that gets
> > security and minor bug fixes for some period of time to give people a
> > longer period to migrate off of Hadoop 2.x? I can't speak for the rest of
> > the committers, but I would personally be more open to maintaining such a
> > branch if it meant that moving forward at least we could update all of our
> > dependencies to newer versions, while providing a transition path to still
> > have at least some support until migrating to Hadoop 3 or native Druid
> > batch ingestion.
> >
> > Any other ideas?
> >
> >
> >
> > On Tue, Jun 8, 2021 at 7:44 PM frank chen <[email protected]> wrote:
> >
> > > Considering Druid takes advantage of lots of external components to
> > work, I
> > > think we should upgrade Druid in a little bit conservitive way. Dropping
> > > support of hadoop2 is not a good idea.
> > > The upgrading of the ZooKeeper client in Druid also prevents me from
> > > adopting 0.22 for a longer time.
> > >
> > > Although users could upgrade these dependencies first to use the latest
> > > Druid releases, frankly speaking, these upgrades are not so easy in
> > > production and usually take longer time, which would prevent users from
> > > experiencing new features of Druid.
> > > For hadoop3, I have heard of some performance issues, which also makes me
> > > have no confidence to upgrade.
> > >
> > > I think what Jihoon proposes is a good idea, separating hadoop2 from
> > Druid
> > > core as an extension.
> > > Since hadoop2 has not been EOF, to achieve balance between compatibility
> > > and long term evolution, maybe we could provide two extensions, one for
> > > hadoop2, one for hadoop3.
> > >
> > >
> > >
> > > Will Lauer <[email protected]> 于2021年6月9日周三 上午4:13写道：
> > >
> > > > Just to follow up on this, our main problem with hadoop3 right now has
> > > been
> > > > instability in HDFS, to the extent that we put on hold any plans to
> > > deploy
> > > > it to our production systems. I would claim Hadoop3 isn't mature enough
> > > yet
> > > > to consider migrating Druid to it.
> > > >
> > > > WIll
> > > >
> > > > <http://www.verizonmedia.com>
> > > >
> > > > Will Lauer
> > > >
> > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > Data Platforms & Systems Engineering
> > > >
> > > > M 508 561 6427
> > > > 1908 S. First St
> > > > Champaign, IL 61822
> > > >
> > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > >   <
> > >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > >
> > > > <
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > >
> > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > >
> > > >
> > > >
> > > >
> > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <[email protected]>
> > > wrote:
> > > >
> > > > > Unfortunately, the migration off of hadoop3 is a hard one (maybe not
> > > for
> > > > > Druid, but certainly for big organizations running large hadoop2
> > > > > workloads). If druid migrated to hadoop3 after 0.22, that would
> > > probably
> > > > > prevent me from taking any new versions of Druid for at least the
> > > > remainder
> > > > > of the year and possibly longer.
> > > > >
> > > > > Will
> > > > >
> > > > >
> > > > > <http://www.verizonmedia.com>
> > > > >
> > > > > Will Lauer
> > > > >
> > > > > Senior Principal Architect, Audience & Advertising Reporting
> > > > > Data Platforms & Systems Engineering
> > > > >
> > > > > M 508 561 6427
> > > > > 1908 S. First St
> > > > > Champaign, IL 61822
> > > > >
> > > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e=
> > >   <
> > > >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e=
> > >
> > > > >    <
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e=
> > >
> > > > > <
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e=
> > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <[email protected]>
> > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I've been assisting with some experiments to see how we might want
> > to
> > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see if
> > > maybe
> > > > we
> > > > >> can finally be free of some of the dependency issues it has been
> > > causing
> > > > >> for as long as I can remember working with Druid.
> > > > >>
> > > > >> Hadoop 3 introduced shaded client jars,
> > > > >>
> > > > >>
> > > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e=
> > > > >> , with the purpose to
> > > > >> allow applications to talk to the Hadoop cluster without drowning in
> > > its
> > > > >> transitive dependencies. The experimental branch that I have been
> > > > helping
> > > > >> with, which is using these new shaded client jars, can be seen in
> > this
> > > > PR
> > > > >>
> > > > >>
> > > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e=
> > > > >> , and is currently working with
> > > > >> the HDFS integration tests as well as the Hadoop tutorial flow in
> > the
> > > > >> Druid
> > > > >> docs (which is pretty much equivalent to the HDFS integration test).
> > > > >>
> > > > >> The cloud deep storages still need some further testing and some
> > minor
> > > > >> cleanup still needs done for the docs and such. Additionally we
> > still
> > > > need
> > > > >> to figure out how to handle the Kerberos extension, because it
> > extends
> > > > >> some
> > > > >> Hadoop classes so isn't able to use the shaded client jars in a
> > > > >> straight-forward manner, and so still has heavy dependencies and
> > > hasn't
> > > > >> been tested. However, the experiment has started to pan out enough
> > to
> > > > >> where
> > > > >> I think it is worth starting this discussion, because it does have
> > > some
> > > > >> implications.
> > > > >>
> > > > >> Making this change I think will allow us to update our dependencies
> > > > with a
> > > > >> lot more freedom (I'm looking at you, Guava), but the catch is that
> > > once
> > > > >> we
> > > > >> make this change and start updating these dependencies, it will
> > become
> > > > >> hard, nearing impossible to support Hadoop 2.x, since as far as I
> > know
> > > > >> there isn't an equivalent set of shaded client jars. I am also not
> > > > certain
> > > > >> how far back the Hadoop job classpath isolation stuff goes
> > > > >> (mapreduce.job.classloader = true) which I think is required to be
> > set
> > > > on
> > > > >> Druid tasks for this shaded stuff to work alongside updated Druid
> > > > >> dependencies.
> > > > >>
> > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x support
> > > after
> > > > >> the
> > > > >> Druid 0.22 release?
> > > > >>
> > > > >
> > > >
> > >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

Reply via email to