I am sure there are other companies out there who are still on Hadoop 2.x with migration to Hadoop 3.x being a no-go. If Druid was to drop support for Hadoop 3.x completely, I am afraid it would prevent users from updating to newer versions of Druid which would be a shame.
FWIW, we have found in practice for high volume use cases that compaction based on Druid's Hadoop based batch ingestion is a lot more scale-able than the native compaction. Having said that, as an alternative, if we can merge Julian's Spark based ingestion PR <https://github.com/apache/druid/issues/9780>s in Druid, that might provide an alternate way for users to get rid of the Hadoop dependency. On Tue, Jul 26, 2022 at 3:19 AM Abhishek Agarwal <abhishek.agar...@imply.io> wrote: > Reviving this conversation again. > @Will - Do you still have concerns about HDFS stability? Hadoop 3 has been > around for some time now and is very stable as far as I know. > > The dependencies coming from Hadoop 2 are also old enough that they cause > dependency scans to fail. E.g. Log4j 1.x dependencies that are coming from > Hadoop 2, get flagged during these scans. We have also seen issues when > customers try to use Hadoop ingestion with the latest log4j2 library. > > Exception in thread "main" java.lang.NoSuchMethodError: > > org.apache.log4j.helpers.OptionConverter.convertLevel(Ljava/lang/String;Lorg/apache/logging/log4j/Level;)Lorg/apache/logging/log4j/Level; > at > > org.apache.log4j.config.PropertiesConfiguration.parseLogger(PropertiesConfiguration.java:393) > at > > org.apache.log4j.config.PropertiesConfiguration.configureRoot(PropertiesConfiguration.java:326) > at > > org.apache.log4j.config.PropertiesConfiguration.doConfigure(PropertiesConfiguration.java:303) > > > Instead of fixing these point issues, we would be better served by > completely moving to Hadoop 3 entirely. Hadoop 3 does get more frequent > releases and dependencies are well isolated. > > On Tue, Oct 12, 2021 at 12:05 PM Karan Kumar <karankumar1...@gmail.com> > wrote: > > > Hello > > We can also use maven profiles. We keep hadoop2 support by default and > add > > a new maven profile with hadoop3. This will allow the user to choose the > > profile which is best suited for the use case. > > Agreed, it will not help in the Hadoop dependency problems but does > enable > > our users to use druid with multiple flavors. > > Also with hadoop3, as clint mentioned, the dependencies come pre-shaded > so > > we significantly reduce our effort in solving the dependency problems. > > I have the PR in the last phases where I am able to run the entire test > > suit unit + integration tests on both the default ie hadoop2 and the new > > hadoop3 profile. > > > > > > > > On 2021/06/09 11:55:31, Will Lauer <wla...@verizonmedia.com.INVALID> > > wrote: > > > Clint, > > > > > > I fully understand what type of headache dealing with these dependency > > > issues is. We deal with this all the time, and based on conversations > > I've > > > had with our internal hadoop development team, they are quite aware of > > them > > > and just as frustrated by them as you are. I'm certainly in favor of > > doing > > > something to improve this situation, as long as it doesn't abandon a > > large > > > section of the user base, which I think DROPPING hadoop2 would do. > > > > > > I think there are solutions there that can help solve the conflicting > > > dependency problem. Refactoring Hadoop support into an independent > > > extension is certainly a start. But I think the dependency problem is > > > bigger than that. There are always going to be conflicts between > > > dependencies in the core system and in extensions as the system gets > > > bigger. We have one right now internally that prevents us from enabling > > SQL > > > in our instance of Druid due to conflicts between versions of protobuf > > used > > > by Calcite vs one of our critical extensions. Long term, I think you > are > > > going to need to carefully think through a ClassLoader based strategy > to > > > truly separate the impact of various dependencies. > > > > > > While I'm not seriously suggesting it for Druid, OSGi WOULD solve this > > > problem. It's a system that allows you to explicitly declare what each > > > bundle exposes to the system, and what each bundle consumes from the > > > system, allowing multiple conflicting dependencies to co-exist without > > > impacting each other. OSGi is the big hammer approach, but I bet a more > > > appropriate solution would be a simpler custom-ClassLoader based > solution > > > that hid all dependencies in extensions, keeping them from impacting > the > > > core, and that only exposed "public" pieces of the core to extensions. > If > > > Druid's core could be extended without impacting the various > extensions, > > > and the extensions' dependencies could be modified without impacting > the > > > core, this would go a long way towards solving the problem that you > have > > > described. > > > > > > Will > > > > > > <http://www.verizonmedia.com> > > > > > > Will Lauer > > > > > > Senior Principal Architect, Audience & Advertising Reporting > > > Data Platforms & Systems Engineering > > > > > > M 508 561 6427 > > > 1908 S. First St > > > Champaign, IL 61822 > > > > > > <http://www.facebook.com/verizonmedia> < > > http://twitter.com/verizonmedia> > > > <https://www.linkedin.com/company/verizon-media/> > > > <http://www.instagram.com/verizonmedia> > > > > > > > > > > > > On Wed, Jun 9, 2021 at 12:47 AM Clint Wylie <cwy...@apache.org> wrote: > > > > > > > @itai, I think pending the outcome of this discussion that it makes > > sense > > > > to have a wider community thread to announce any decisions we make > > here, > > > > thanks for bringing that up. > > > > > > > > @rajiv, Minio support seems unrelated to this discussion. It seems > > like a > > > > reasonable request, but I recommend starting another thread to see if > > > > someone is interested in taking up this effort. > > > > > > > > @jihoon I definitely agree that Hadoop should be refactored to be an > > > > extension longer term. I don't think this upgrade would necessarily > > > > make doing such a refactor any easier, but not harder either. Just > > moving > > > > Hadoop to an extension also unfortunately doesn't really do anything > to > > > > help our dependency problem though, which is the thing that has > > agitated me > > > > enough to start this thread and start looking into solutions. > > > > > > > > @will/@frank I feel like the stranglehold Hadoop has on our > > dependencies > > > > has started to become especially more painful in the last couple of > > > > years. Most painful to me is that we are stuck using a version of > > Apache > > > > Calcite from 2019 (six versions behind the latest), because newer > > versions > > > > require a newer version of Guava. This means we cannot get any bug > > fixes > > > > and improvements in our SQL parsing layer without doing something > like > > > > packaging a shaded version of it ourselves or solving our Hadoop > > dependency > > > > problem. > > > > > > > > Many other dependencies have also proved problematic with Hadoop as > > well in > > > > the past, and since we aren't able to run the Hadoop integration > tests > > in > > > > Travis, there is always the chance that sometimes we don't catch > these > > when > > > > they go in. I imagine now that we have turned on dependabot this > week, > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11079&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=0LG0RjDQ1wFfBdl9aPg3-4oJPvJJs26aQsK8KSYLp2s&e= > > > > , that we are going to have to > > > > proceed very carefully with it until we are able to resolve this > > dependency > > > > issue. > > > > > > > > Hadoop 3.3.0 is also the first to support running on a Java version > > that is > > > > newer than Java 8 per > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HADOOP_Hadoop-2BJava-2BVersions&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=a5NmwtOWhCNvY4si_OKf0BRh_FTRpmCJHpTaGs8S64A&e= > > > > , > > > > which is another area we have been working towards - Druid to > > officially > > > > support Java 11+ environments. > > > > > > > > I'm sort of at a loss of what else to do besides one of > > > > - switching to these Hadoop 3 shaded jars and dropping 2.x support > > > > - figuring out how to custom package our own Hadoop 2.x > dependendencies > > > > that are shaded similarly to the Hadoop 3 client jars, and only > > supporting > > > > Hadoop with application classpath isolation > (mapreduce.job.classloader > > = > > > > true) > > > > - just dropping support for Hadoop completely > > > > > > > > I would much rather devote all effort into making Druids native batch > > > > ingestion better to encourage people to migrate to that, than > > continuing to > > > > fight with figuring out how to keep supporting Hadoop, so upgrading > and > > > > switching to the shaded client jars at least seemed like a reasonable > > > > compromise to dropping it completely. Maybe making custom shaded > Hadoop > > > > dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard > > as I > > > > am imagining, but it does seem like the most amount of work between > the > > > > solutions I could think of to potentially resolve this problem. > > > > > > > > Does anyone have any other ideas of how we can isolate our > dependencies > > > > from Hadoop? Solutions like shading Guava, > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_10964&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=RmDhdAX6x_cU5sebIqzFpGXpo3NnYAYqeyEvwnA-pgw&e= > > > > , would let Druid itself use > > > > newer Guava, but that doesn't help conflicts within our dependencies > > which > > > > has always seemed to be the larger problem to me. Moving Hadoop > > support to > > > > an extension doesn't help anything unless we can ensure that we can > run > > > > Druid ingestion tasks on Hadoop without having to match all of the > > Hadoop > > > > clusters dependencies with some sort of classloader wizardry. > > > > > > > > Maybe we could consider keeping a 0.22.x release line in Druid that > > gets > > > > security and minor bug fixes for some period of time to give people a > > > > longer period to migrate off of Hadoop 2.x? I can't speak for the > rest > > of > > > > the committers, but I would personally be more open to maintaining > > such a > > > > branch if it meant that moving forward at least we could update all > of > > our > > > > dependencies to newer versions, while providing a transition path to > > still > > > > have at least some support until migrating to Hadoop 3 or native > Druid > > > > batch ingestion. > > > > > > > > Any other ideas? > > > > > > > > > > > > > > > > On Tue, Jun 8, 2021 at 7:44 PM frank chen <frankc...@apache.org> > > wrote: > > > > > > > > > Considering Druid takes advantage of lots of external components to > > > > work, I > > > > > think we should upgrade Druid in a little bit conservitive way. > > Dropping > > > > > support of hadoop2 is not a good idea. > > > > > The upgrading of the ZooKeeper client in Druid also prevents me > from > > > > > adopting 0.22 for a longer time. > > > > > > > > > > Although users could upgrade these dependencies first to use the > > latest > > > > > Druid releases, frankly speaking, these upgrades are not so easy in > > > > > production and usually take longer time, which would prevent users > > from > > > > > experiencing new features of Druid. > > > > > For hadoop3, I have heard of some performance issues, which also > > makes me > > > > > have no confidence to upgrade. > > > > > > > > > > I think what Jihoon proposes is a good idea, separating hadoop2 > from > > > > Druid > > > > > core as an extension. > > > > > Since hadoop2 has not been EOF, to achieve balance between > > compatibility > > > > > and long term evolution, maybe we could provide two extensions, one > > for > > > > > hadoop2, one for hadoop3. > > > > > > > > > > > > > > > > > > > > Will Lauer <wla...@verizonmedia.com.invalid> 于2021年6月9日周三 > 上午4:13写道: > > > > > > > > > > > Just to follow up on this, our main problem with hadoop3 right > now > > has > > > > > been > > > > > > instability in HDFS, to the extent that we put on hold any plans > to > > > > > deploy > > > > > > it to our production systems. I would claim Hadoop3 isn't mature > > enough > > > > > yet > > > > > > to consider migrating Druid to it. > > > > > > > > > > > > WIll > > > > > > > > > > > > <http://www.verizonmedia.com> > > > > > > > > > > > > Will Lauer > > > > > > > > > > > > Senior Principal Architect, Audience & Advertising Reporting > > > > > > Data Platforms & Systems Engineering > > > > > > > > > > > > M 508 561 6427 > > > > > > 1908 S. First St > > > > > > Champaign, IL 61822 > > > > > > > > > > > > < > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e= > > > > > < > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e= > > > > > > > > > > > < > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e= > > > > > > > > > > > < > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e= > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer < > wla...@verizonmedia.com > > > > > > > > wrote: > > > > > > > > > > > > > Unfortunately, the migration off of hadoop3 is a hard one > (maybe > > not > > > > > for > > > > > > > Druid, but certainly for big organizations running large > hadoop2 > > > > > > > workloads). If druid migrated to hadoop3 after 0.22, that would > > > > > probably > > > > > > > prevent me from taking any new versions of Druid for at least > the > > > > > > remainder > > > > > > > of the year and possibly longer. > > > > > > > > > > > > > > Will > > > > > > > > > > > > > > > > > > > > > <http://www.verizonmedia.com> > > > > > > > > > > > > > > Will Lauer > > > > > > > > > > > > > > Senior Principal Architect, Audience & Advertising Reporting > > > > > > > Data Platforms & Systems Engineering > > > > > > > > > > > > > > M 508 561 6427 > > > > > > > 1908 S. First St > > > > > > > Champaign, IL 61822 > > > > > > > > > > > > > > < > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=FZ4dYSh4h5dDUO8gMu1WnMJYULsDN4hZPNJUqDythiU&e= > > > > > < > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=W_tqzh_jnVhXD_NXIsB8s-f7F_ZO1QCYPv3U1OyNJfs&e= > > > > > > > > > > > > < > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_verizon-2Dmedia_&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=U6DtsEa4Fr2uBu39uaxBIK_th685qDrjPaO3kXZZ0d8&e= > > > > > > > > > > > > < > > > > > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.instagram.com_verizonmedia&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=6ZP1rygSgHS9fZ6sNwI10fe7Zr9_IIAxDoe_TVLHPjc&s=gneN2k-ykLUBzoWtYZNsSZ9Bxki7XEvx2tliibfAXys&e= > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cwy...@apache.org> > > > > wrote: > > > > > > > > > > > > > >> Hi all, > > > > > > >> > > > > > > >> I've been assisting with some experiments to see how we might > > want > > > > to > > > > > > >> migrate Druid to support Hadoop 3.x, and more importantly, see > > if > > > > > maybe > > > > > > we > > > > > > >> can finally be free of some of the dependency issues it has > been > > > > > causing > > > > > > >> for as long as I can remember working with Druid. > > > > > > >> > > > > > > >> Hadoop 3 introduced shaded client jars, > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e= > > > > > > >> , with the purpose to > > > > > > >> allow applications to talk to the Hadoop cluster without > > drowning in > > > > > its > > > > > > >> transitive dependencies. The experimental branch that I have > > been > > > > > > helping > > > > > > >> with, which is using these new shaded client jars, can be seen > > in > > > > this > > > > > > PR > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e= > > > > > > >> , and is currently working with > > > > > > >> the HDFS integration tests as well as the Hadoop tutorial flow > > in > > > > the > > > > > > >> Druid > > > > > > >> docs (which is pretty much equivalent to the HDFS integration > > test). > > > > > > >> > > > > > > >> The cloud deep storages still need some further testing and > some > > > > minor > > > > > > >> cleanup still needs done for the docs and such. Additionally > we > > > > still > > > > > > need > > > > > > >> to figure out how to handle the Kerberos extension, because it > > > > extends > > > > > > >> some > > > > > > >> Hadoop classes so isn't able to use the shaded client jars in > a > > > > > > >> straight-forward manner, and so still has heavy dependencies > and > > > > > hasn't > > > > > > >> been tested. However, the experiment has started to pan out > > enough > > > > to > > > > > > >> where > > > > > > >> I think it is worth starting this discussion, because it does > > have > > > > > some > > > > > > >> implications. > > > > > > >> > > > > > > >> Making this change I think will allow us to update our > > dependencies > > > > > > with a > > > > > > >> lot more freedom (I'm looking at you, Guava), but the catch is > > that > > > > > once > > > > > > >> we > > > > > > >> make this change and start updating these dependencies, it > will > > > > become > > > > > > >> hard, nearing impossible to support Hadoop 2.x, since as far > as > > I > > > > know > > > > > > >> there isn't an equivalent set of shaded client jars. I am also > > not > > > > > > certain > > > > > > >> how far back the Hadoop job classpath isolation stuff goes > > > > > > >> (mapreduce.job.classloader = true) which I think is required > to > > be > > > > set > > > > > > on > > > > > > >> Druid tasks for this shaded stuff to work alongside updated > > Druid > > > > > > >> dependencies. > > > > > > >> > > > > > > >> Is anyone opposed to or worried about dropping Hadoop 2.x > > support > > > > > after > > > > > > >> the > > > > > > >> Druid 0.22 release? > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > >