Re: [DISCUSS] Hadoop ingestion support

Lucas Capistrant Mon, 23 Jun 2025 14:31:53 -0700

Thanks for your input from Roku user point of view, Krishna. We are
definitely in a tough spot here because of Hadoop support preventing us
from dropping Java 11 support. And then the domino effect being we can’t
upgrade off of EOL dependencies such as Jetty 9.


In the Java 11 support discussion,
https://lists.apache.org/thread/bvkztwoyy35mvyqkccp87zrfd68sqqkw, we
discuss the risk of supporting Java 11 beyond Druid 34. I think the biggest
worry is that we are going to get caught in a situation where a patch fix
for a CVE could require dropping Java 11 and Hadoop support in a patch
release because resolving the CVE requires dependency upgrades that don’t
support 11. Delaying dropping support until Druid 36 makes it all the more
likely that we run into that situation.

If we were to drop Hadoop ingest support in October as a part of Druid 35,
would there be a clear path forward for your Druid deployments? Assuming
the community provides a solid migration plan for open source users
regarding Hadoop ingestion alternatives.

Also, if there is a path to supporting Hadoop ingestion as a contrib
extension and someone in the community wanted to carry the torch on its
development, that is definitely a possibility as well. Though, I’m not sure
that anyone has scoped out how much work that would be, or if it’s even
possible to achieve.

Thanks,
Lucas

On Wed, Jun 18, 2025 at 6:50 PM Krishna Thirumalasetty <kthir...@gmail.com>
wrote:

> Hi everyone,
>
> Adding to the voices from Netflix and Target — at Roku Inc., we also rely
> heavily on Hadoop-based batch ingestion for a significant portion of our
> Druid datasources. This approach allows us to leverage our existing Hadoop
> infrastructure efficiently and cost-effectively for large-scale batch
> processing.
>
> If the community decides to move forward with the removal of Hadoop
> ingestion support, it would likely force us to remain on an older version
> of Druid for some time. This is not ideal, as it would prevent us from
> benefiting from ongoing improvements, security updates, and newer features
> in the Druid ecosystem.
>
> That said, we fully understand and support the broader goals of modernizing
> the Druid platform, reducing tech debt, and enabling the use of more
> current Java features and dependency upgrades. Given these competing
> priorities, we believe the best path forward would be:
>
>    - *Clear deprecation communication* in Druid 32, discouraging new
>    adoption while giving teams time to react.
>    - *An official target removal date*, such as in Druid 36 (early 2026),
>    which provides adequate lead time for organizations like ours to
> evaluate
>    alternatives and begin planning migrations.
>    - *Consideration of keeping the Hadoop ingestion module as a contrib
>    extension*, or at least providing a supported migration path with
>    documentation to MM-less ingestion or other batch ingestion
> alternatives.
>
> This approach would help companies like Roku manage the transition in a
> predictable and structured way, while also empowering the Druid community
> to move forward with more agility.
>
> Thanks for raising this important discussion.
>
> Best,
> Krishna Thirumalasetty
> Roku Inc.
>
> On Tue, Jun 17, 2025 at 3:28 PM Eyal Yurman <eyal.yur...@gmail.com> wrote:
>
> > Sharing as another data point -
> >
> > We still use YARN to run Hadoop-based batch ingestion. Very useful
> > on-premise for resource sharing, where autoscaling isn't always an
> option.
> > But we plan to move to Kubernetes for ingestion sometime next year.
> >
> >
> > On Tue, Jun 17, 2025 at 12:20 PM Gian Merlino <g...@apache.org> wrote:
> >
> > > I'm on board with this. I also think we should deprecate it ASAP,
> > starting
> > > in the next major release. It'd be nice to also build a migration guide
> > > that helps people move from Hadoop ingestion to SQL/MSQ ingestion, and
> > from
> > > YARN to K8S pod runners.
> > >
> > > Gian
> > >
> > > On 2025/06/09 20:10:03 Clint Wylie wrote:
> > > > Following up on this, I want to propose the first release of 2026 for
> > > > removal, which I think would be Druid 36, to give some lead time for
> > > > those affected to prepare.
> > > >
> > > > On Wed, Apr 9, 2025 at 8:42 AM Frank Chen <frankc...@apache.org>
> > wrote:
> > > > >
> > > > > We don't use Hadoop ingestion, it's OK for us to drop the support
> of
> > > Hadoop.
> > > > >
> > > > > We can make an announcement to deprecate it first(from 33?), remove
> > it
> > > from
> > > > > official distribution( but keep the ability to build it as above
> > > suggested,
> > > > > from 34?),
> > > > > and remove it completely at a proper time.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Apr 9, 2025 at 5:02 AM Maytas Monsereenusorn <
> > > mayt...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I'm in favor of removing too but we should not rush the removal
> and
> > > make
> > > > > > sure we give enough time for users to migrate to other types of
> > > ingestion.
> > > > > > Similar to what Lucas said, if Hadoop is holding back Druid then
> we
> > > should
> > > > > > remove it. Druid also supports many other types of ingestion
> > > compared to
> > > > > > back when Hadoop ingestion was added.
> > > > > > For Netflix, we will be migrating to MM-less Druid ingestion in
> > K8s.
> > > I
> > > > > > think MM-less Druid ingestion in K8s is probably the closest to
> > > Hadoop
> > > > > > ingestion as we do not have to maintain a dedicated Druid
> specific
> > MM
> > > > > > cluster (works well for companies with existing large/shared
> > Compute
> > > > > > clusters). Personally, I feel we should focus our energy on
> things
> > > > > > like MM-less Druid in K8s (which is still marked as Experimental)
> > > rather
> > > > > > than Hadoop.
> > > > > >
> > > > > > Best Regards,
> > > > > > Maytas
> > > > > >
> > > > > > On Tue, Apr 8, 2025 at 4:06 AM Lucas Capistrant <
> > > > > > capistrant.lu...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Yes, I’m in favor of removing it from the core release and also
> > in
> > > favor
> > > > > > of
> > > > > > > officially announcing deprecation with a timeline for removal,
> if
> > > we have
> > > > > > > not yet. It stinks to lose the Hadoop ingest support, but if
> that
> > > project
> > > > > > > is going to hold back Druid, it seems we don’t have much
> choice.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Lucas
> > > > > > >
> > > > > > > On Tue, Apr 8, 2025 at 4:27 AM Karan Kumar <ka...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Like the plan of having a hadoop profile, not shipping it a
> > part
> > > of the
> > > > > > > > apache release and then we can eventually remove it in a
> > release
> > > or 2 .
> > > > > > > > Does that work for you folks Maytas, Lucas ?
> > > > > > > >
> > > > > > > > On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <k...@rxd.hu
> >
> > > wrote:
> > > > > > > >
> > > > > > > >> Hey,
> > > > > > > >>
> > > > > > > >> I was also bumping into this while I was running
> > > dependency-checks for
> > > > > > > >> Druid-33
> > > > > > > >> * I've  encountered a CVE [1] in hadoop-runtime-3.3.6 which
> > is a
> > > > > > shaded
> > > > > > > >> jar
> > > > > > > >> * we have a PR to upgrade to 3.4.0 ; so I checked also
> 3.4.1 -
> > > but
> > > > > > they
> > > > > > > >> are also affected as they ship with (jetty is
> > 9.4.53.v20231009)
> > > [2]
> > > > > > > >>
> > > > > > > >> ..so right now there is no normal way to solve this - the
> fact
> > > that
> > > > > > its
> > > > > > > a
> > > > > > > >> shaded jar further complicates things..
> > > > > > > >>
> > > > > > > >> Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is
> good;
> > > so there
> > > > > > > >> will be some future version which might be not affected
> > > > > > > >> I wanted to be thorough and digged into a few things - to
> see
> > > how soon
> > > > > > > an
> > > > > > > >> updated version may come out:
> > > > > > > >> * there are a 300+ tickets targeted for 3.5.0 .. so that
> > > doesn't looks
> > > > > > > >> promising
> > > > > > > >> * but even for 3.4.2 there is a huge jira [4] with 159
> > subtasks
> > > out of
> > > > > > > >> which 123 is unassigned...
> > > > > > > >>    if that's really needed for 3.4.2 then I doubt they'll be
> > > rolling
> > > > > > out
> > > > > > > >> a release soon...
> > > > > > > >> * I was also peeking into jdk17 jiras which will most likely
> > > arrive in
> > > > > > > >> 3.5.0 [5]
> > > > > > > >>
> > > > > > > >> Keeping Hadoop like this will hold us back from:
> > > > > > > >> * upgrading 3rd party deps
> > > > > > > >> * forces us to add security supressions
> > > > > > > >> * slows down newer jdk adoption - as officially hadoop only
> > > supports
> > > > > > 11
> > > > > > > >>
> > > > > > > >> I think most of the companies using Hadoop are utilizing
> > > binaries
> > > > > > which
> > > > > > > >> are being built from forks - and they also have the
> > > ability&bandwidth
> > > > > > to
> > > > > > > >> fix these 3rd party
> > > > > > > >> libraries...
> > > > > > > >> I would also guess that they might be also using a custom
> > built
> > > Druid
> > > > > > -
> > > > > > > >> and as a result: they have more control over what kind of
> > > features
> > > > > > they
> > > > > > > >> have or not.
> > > > > > > >>
> > > > > > > >> So I was wondering about the following:
> > > > > > > >> * add a maven profile for hadoop support (defaults to off)
> > > > > > > >> * retain compaibility: during CI runs: build with jdk11 and
> > run
> > > all
> > > > > > > >> hadoop tests
> > > > > > > >> * future releases (>=34) would ship w/o hadoop ingestion
> > > > > > > >> * companies using hadoop-ingestion could turn on the profile
> > > and use
> > > > > > it
> > > > > > > >>
> > > > > > > >> What do you guys think?
> > > > > > > >>
> > > > > > > >> cheers,
> > > > > > > >> Zoltan
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201
> > > > > > > >> [2]
> > > > > > > >>
> > > > > > >
> > > > > >
> > >
> >
> https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40
> > > > > > > >> [3]
> > > > > > > >>
> > > > > > >
> > > > > >
> > >
> >
> https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40
> > > > > > > >> [4] https://issues.apache.org/jira/browse/HADOOP-19353
> > > > > > > >> [5] https://issues.apache.org/jira/browse/HADOOP-17177
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 1/8/25 11:56, Abhishek Agarwal wrote:
> > > > > > > >> > @Adarsh - FYI since you are the release manager for 32.
> > > > > > > >> >
> > > > > > > >> > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <
> > > > > > abhis...@apache.org
> > > > > > > >
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> >> I don't want to kick that can too far down the road
> either
> > > :) We
> > > > > > > don't
> > > > > > > >> >> want to give a false hope that it's going to remain
> around
> > > forever.
> > > > > > > >> But yes
> > > > > > > >> >> let's deprecate both Hadoop and Java 11 support in the
> > > upcoming 32
> > > > > > > >> release.
> > > > > > > >> >> It's unfortunate that Hadoop still doesn't support Java
> 17.
> > > We
> > > > > > > >> shouldn't
> > > > > > > >> >> let it hold us back. Jetty, pac4j are dropping Java 11
> > > support and
> > > > > > we
> > > > > > > >> would
> > > > > > > >> >> want to upgrade to newer versions of these dependencies
> > > soon. There
> > > > > > > are
> > > > > > > >> >> also nice language features in Java 17 such as pattern
> > > matching,
> > > > > > > >> multiline
> > > > > > > >> >> strings, and a lot more that we can't use if we have to
> be
> > > compile
> > > > > > > >> >> compatible with Java 11. If you need the resource
> > elasticity
> > > that
> > > > > > > >> Hadoop
> > > > > > > >> >> provides or want to reuse shared infrastructure in the
> > > company,
> > > > > > > MM-less
> > > > > > > >> >> ingestion is a good alternative.
> > > > > > > >> >>
> > > > > > > >> >> So let's deprecate it in 32. We can decide on removal
> later
> > > but
> > > > > > > >> hopefully,
> > > > > > > >> >> it doesn't take too many releases to do that.
> > > > > > > >> >>
> > > > > > > >> >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <
> > ka...@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >> >>
> > > > > > > >> >>> Okay from what I can gather few folks still need hadoop
> > > ingestion.
> > > > > > > So
> > > > > > > >> >>> let's
> > > > > > > >> >>> kick the can down the road regarding removal of that
> > > support but
> > > > > > > let's
> > > > > > > >> >>> agree on the deprecation plan. Since druid 32 is around
> > the
> > > corner
> > > > > > > >> let's
> > > > > > > >> >>> atleast deprecated hadoop ingestion so that any new
> users
> > > are not
> > > > > > > >> >>> onboarded
> > > > > > > >> >>> to this way of ingestion. Deprecation also becomes a
> > forcing
> > > > > > > function
> > > > > > > >> in
> > > > > > > >> >>> internal company channel's for prioritization of getting
> > off
> > > > > > hadoop.
> > > > > > > >> >>>
> > > > > > > >> >>> How does this plan look?
> > > > > > > >> >>>
> > > > > > > >> >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn <
> > > > > > > >> mayt...@apache.org
> > > > > > > >> >>>>
> > > > > > > >> >>> wrote:
> > > > > > > >> >>>
> > > > > > > >> >>>> We at Netflix are in a similar situation to Target
> > > Corporation
> > > > > > > >> (Lucas C
> > > > > > > >> >>>> email above).
> > > > > > > >> >>>> We currently rely on Hadoop ingestion for all our batch
> > > ingestion
> > > > > > > >> jobs.
> > > > > > > >> >>> The
> > > > > > > >> >>>> main reason for this is that we already have a large
> > Hadoop
> > > > > > cluster
> > > > > > > >> >>>> supporting our Spark workloads that we can leverage for
> > > Druid
> > > > > > > >> >>> ingestion. I
> > > > > > > >> >>>> imagine that the closest alternative for us would be
> > > moving to
> > > > > > K8 /
> > > > > > > >> >>>> MiddleManager-less ingestion job.
> > > > > > > >> >>>>
> > > > > > > >> >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
> > > > > > > >> >>>> capistrant.lu...@gmail.com> wrote:
> > > > > > > >> >>>>
> > > > > > > >> >>>>> Apologies for the empty email… fat fingers.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> Just wanted to say that we at Target Corporation
> (USA),
> > > still
> > > > > > rely
> > > > > > > >> >>>> heavily
> > > > > > > >> >>>>> on Hadoop ingest. We’d selfishly want support forever,
> > > but if
> > > > > > > forced
> > > > > > > >> >>> to
> > > > > > > >> >>>>> pivot to a new ingestion style for our larger batch
> > > ingest jobs
> > > > > > > that
> > > > > > > >> >>>>> currently leverage the cheap compute on YARN, the
> longer
> > > the
> > > > > > lead
> > > > > > > >> time
> > > > > > > >> >>>>> between announcement by the community to the actual
> > > release with
> > > > > > > no
> > > > > > > >> >>>>> support, the better. Making these types of changes can
> > be
> > > a slow
> > > > > > > >> >>> process
> > > > > > > >> >>>>> for the slow to maneuver corporate cruise ship.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
> > > > > > > >> >>>>> capistrant.lu...@gmail.com>
> > > > > > > >> >>>>> wrote:
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <
> > > ka...@apache.org>
> > > > > > > >> >>> wrote:
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>> +1 for removal of Hadoop based ingestion. It's a
> > > maintenance
> > > > > > > >> >>> overhead
> > > > > > > >> >>>>> and
> > > > > > > >> >>>>>>> stops us from moving to java 17.
> > > > > > > >> >>>>>>> I am not aware of any gaps in sql based ingestion
> > which
> > > limits
> > > > > > > >> >>> users
> > > > > > > >> >>>> to
> > > > > > > >> >>>>>>> move off from hadoop. If there are any, please feel
> > > free to
> > > > > > > reach
> > > > > > > >> >>> out
> > > > > > > >> >>>>> via
> > > > > > > >> >>>>>>> slack/github.
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <
> > > > > > cwy...@apache.org>
> > > > > > > >> >>>> wrote:
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>>> Hey everyone,
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> It is about that time again to take a pulse on how
> > > commonly
> > > > > > > >> >>> Hadoop
> > > > > > > >> >>>>>>>> based ingestion is used with Druid in order to
> > > determine if
> > > > > > we
> > > > > > > >> >>>> should
> > > > > > > >> >>>>>>>> keep supporting it or not going forward.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> In my view, Hadoop based ingestion has unofficially
> > > been on
> > > > > > > life
> > > > > > > >> >>>>>>>> support for quite some time as we do not really go
> > out
> > > of our
> > > > > > > >> >>> way to
> > > > > > > >> >>>>>>>> add new features to it, and we perform very minimal
> > > testing
> > > > > > to
> > > > > > > >> >>>> ensure
> > > > > > > >> >>>>>>>> everything keeps working. The most recent changes
> to
> > > it I am
> > > > > > > >> >>> aware
> > > > > > > >> >>>> of
> > > > > > > >> >>>>>>>> was to bump versions and require Hadoop 3, but that
> > was
> > > > > > > primarily
> > > > > > > >> >>>>>>>> motivated by selfish reasons of wanting to use its
> > > contained
> > > > > > > >> >>> client
> > > > > > > >> >>>>>>>> library and better isolation so that we could free
> up
> > > our own
> > > > > > > >> >>>>>>>> dependencies to be updated. This thread is
> motivated
> > > by a
> > > > > > > similar
> > > > > > > >> >>>>>>>> reason I guess, see the other thread I started
> > recently
> > > > > > > >> >>> discussing
> > > > > > > >> >>>>>>>> dropping support for Java 11 where Hadoop does not
> > yet
> > > > > > support
> > > > > > > >> >>> Java
> > > > > > > >> >>>> 17
> > > > > > > >> >>>>>>>> runtime, and so the outcome of this discussion is
> > > involved in
> > > > > > > >> >>> those
> > > > > > > >> >>>>>>>> plans.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> I think SQL based ingestion with the multi-stage
> > query
> > > engine
> > > > > > > is
> > > > > > > >> >>> the
> > > > > > > >> >>>>>>>> future of batch ingestion, and the Kubernetes based
> > > task
> > > > > > runner
> > > > > > > >> >>>>>>>> provides an alternative for task auto scaling
> > > capabilities.
> > > > > > > >> >>> Because
> > > > > > > >> >>>> of
> > > > > > > >> >>>>>>>> this, I don't personally see a lot of compelling
> > > reasons to
> > > > > > > keep
> > > > > > > >> >>>>>>>> supporting Hadoop, so I would be in favor of just
> > > dropping
> > > > > > > >> >>> support
> > > > > > > >> >>>> for
> > > > > > > >> >>>>>>>> it completely, though I see no harm in keeping HDFS
> > > deep
> > > > > > > storage
> > > > > > > >> >>>>>>>> around. In past discussions I think we had tied
> > Hadoop
> > > > > > removal
> > > > > > > to
> > > > > > > >> >>>>>>>> adding something like Spark to replace it, but I
> > > wonder if
> > > > > > this
> > > > > > > >> >>>> still
> > > > > > > >> >>>>>>>> needs to be the case.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> I do know that classically there have been quite a
> > lot
> > > of
> > > > > > large
> > > > > > > >> >>>> Druid
> > > > > > > >> >>>>>>>> clusters in the wild still relying on Hadoop in
> > > previous dev
> > > > > > > list
> > > > > > > >> >>>>>>>> discussions about this topic, so I wanted to check
> to
> > > see if
> > > > > > > >> >>> this is
> > > > > > > >> >>>>>>>> still true and if so if any of these clusters have
> > > plans to
> > > > > > > >> >>>> transition
> > > > > > > >> >>>>>>>> to newer ways of ingesting data like SQL based
> > > ingestion.
> > > > > > While
> > > > > > > >> >>>> from a
> > > > > > > >> >>>>>>>> dev/maintenance perspective it would be best to
> just
> > > drop it
> > > > > > > >> >>>>>>>> completely, if there is still a large user base I
> > > think we
> > > > > > need
> > > > > > > >> >>> to
> > > > > > > >> >>>> be
> > > > > > > >> >>>>>>>> open to keeping it around for a while longer. If we
> > do
> > > need
> > > > > > to
> > > > > > > >> >>> keep
> > > > > > > >> >>>>>>>> it, maybe it would be worth it to invest some time
> in
> > > moving
> > > > > > it
> > > > > > > >> >>>> into a
> > > > > > > >> >>>>>>>> contrib extension so that it isn't bundled by
> default
> > > with
> > > > > > > Druid
> > > > > > > >> >>>>>>>> releases to discourage new adoption and more
> > accurately
> > > > > > reflect
> > > > > > > >> >>> its
> > > > > > > >> >>>>>>>> current status in Druid.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > >> >>>>>>>> To unsubscribe, e-mail:
> > > dev-unsubscr...@druid.apache.org
> > > > > > > >> >>>>>>>> For additional commands, e-mail:
> > > dev-h...@druid.apache.org
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>
> > > > > > > >> >>>
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
> > --
> >
> > Best regards,
> > Eyal Yurman
> >
>

Re: [DISCUSS] Hadoop ingestion support

Reply via email to