Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

2022-08-22 Thread Paul Rogers
Gian mentioned MSQ. The new MSQ work is exciting and powerful for Druid 
ingestion. If the data needs cleaning, we would expect users to employ 
something like Spark to do that task, then emit clean data to Kafka or files, 
which Druid MSQ can ingest. That is:

Dirty data —> Spark —> Kafka/Files —> Druid with MSQ

Spark is an industry-standard tool and has a wide set of data engineering 
features developed over many years. Spark is great at data conversion, data 
cleaning, “enrichment” (joins), etc. IMHO, there is no reason for Druid MSQ to 
duplicate these generic Spark features: MSQ is about loading clean data into 
Druid. For users familiar with Spark, Julian’s Spark connector avoids the 
multi-step path, Spark can do the work directly.

Looks like Spark still supports Hadoop 2. Since Spark has sorted out these 
issues, then as Samarth suggested, perhaps Druid wouldn’t need to, if we had a 
Spark connector.

I recall that one discussion was whether the connector should be part of core 
Druid, or as come kind of extension. Though, we don’t have a “Druid 
marketplace” or similar solution manage “third party” extensions. I’m not aware 
that such a feature is under discussion, so we having the Spark connector in 
Druid itself may be the only short-term solution. Or, is there another option?

Julian,

On the PR thread, I mentioned the work that was done to allow “external tasks” 
such as Spark. That work is waiting for the “new IT” stuff to land so we can 
reasonably write integration tests. My sense is that the external task support 
will reduce some of the more fiddly bits of the Spark PR.

Maytas,

Thanks for offering to review Julian’s PR. We do need a committer to help push 
this PR over the line.

Thanks,

- Paul



> On Aug 8, 2022, at 9:13 PM, Gian Merlino  wrote:
> 
> It's always good to deprecate things for some time prior to removing them,
> so we don't need to (nor should we) remove Hadoop 2 support right now. My
> vote is that in this upcoming release, we should deprecate it. The main
> problem in my eyes is the one Abhishek brought up: the dependency
> management situation with Hadoop 2 is really messy, and I'm not sure
> there's a good way to handle them given the limited classloader isolation.
> This situation becomes tougher to manage with each release, and we haven't
> had people volunteering to find and build comprehensive solutions. It is
> time to move on.
> 
> The concern Samarth raised, that people may end up stuck on older Druid
> versions because they aren't able to upgrade to Hadoop 3, is valid. I can
> see two good solutions to this. First: we can improve native ingest to the
> point where people feel broadly comfortable moving Hadoop 2 workloads to
> native. The work planned as part of doing ingest via multi-stage
> distributed query  is going
> to be useful here, by improving the speed and scalability of native ingest.
> Second: it would also be great to have something similar that runs on
> Spark, for people that have made investments in Spark. I suspect that most
> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
> both of those would ease a lot of the potential pain of dropping Hadoop 2
> support.
> 
> On Spark: I'm not familiar with the current state of the Spark work. Is it
> stuck? If so could something be done to unstick it? I agree with Abhishek
> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
> great if we could get it done before actually removing Hadoop 2 support
> from the code base.
> 
> 
> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal 
> wrote:
> 
>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
>> low-resistance path than moving from Hadoop to Spark. even if we get that
>> PR merged, it will take good time for spark integration to reach the same
>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
>> argument against spark integration. it will certainly be nice to have Spark
>> as an option. Just that spark integration doesn't become a blocker for us
>> to get off Hadoop.
>> 
>> btw are you using Hadoop 2 right now with the latest druid version? If so,
>> did you run into similar errors that I posted in my last email?
>> 
>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain 
>> wrote:
>> 
>>> I am sure there are other companies out there who are still on Hadoop 2.x
>>> with migration to Hadoop 3.x being a no-go.
>>> If Druid was to drop support for Hadoop 3.x completely, I am afraid it
>>> would prevent users from updating to newer versions of Druid which would
>> be
>>> a shame.
>>> 
>>> FWIW, we have found in practice for high volume use cases that compaction
>>> based on Druid's Hadoop based batch ingestion is a lot more scale-able
>> than
>>> the native compaction.
>>> 
>>> Having said that, as an alternative, if we can merge Julian's Spark based
>>> ingestion PR s in Druid,
>> 

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

2022-08-22 Thread Maytas Monsereenusorn
Hi Julian,

Thank you so much for your contribution on Spark support. As an existing
committer, I would like to help get the Spark connector merged into OSS
(including PR reviews and any other development work that may be needed).
We can move the conversation regarding Spark support into a new thread or
reuse the Github issue already opened to keep this thread on topic with
dropping support for Hadoop 2.x.

Best Regards,
Maytas

On Sun, Aug 21, 2022 at 11:55 PM Julian Jaffe 
wrote:

> For Spark support, the connector I wrote remains functional but I haven’t
> updated the PR for six months or so since it didn’t seem like there was an
> appetite for review. If that’s changing I could migrate back some more
> recent changes to the OSS PR. Even with an up-to-date patch though I see
> two problems:
>
> First, I remain worried that there isn’t sufficient support among
> committers for the Spark connector. I don’t want Druid to end up in the
> same place it is now for Hadoop 2 support where no one really maintains the
> Spark code and we wind up with another awkward corner of the code base that
> holds back other development.
>
> Secondly, the PR I have up is for Spark 2.4, which is now 2 years further
> out of date than it was back in 2020. Similarly to Hadoop there is a
> bifurcation in the community and Spark 2.4 is still in heavy use but we
> might be trading one problem for another if we deprecate Hadoop 2 in favor
> of Spark 2.4. I have written a Spark 3.2 connector as well but it’s been
> deployed to significantly smaller use cases than the 2.4 line.
>
> Even with these two caveats, if there’s a desire among the Druid
> development community to add Spark functionality and support it I’d love to
> push this across the finish line.
>
> > On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal 
> wrote:
> >
> > Yes. We should deprecate it first which is similar to dropping the
> support
> > (no more active development) but we will still ship it for a release or
> > two. In a way, we are already in that mode to a certain extent. Many
> > features are being built with native ingestion as a first-class citizen.
> > E.g. range partitioning is still not supported on Hadoop ingestion. It's
> > hard for developers to build and test their business logic for all the
> > ingestion modes.
> >
> > It will be good to hear what gaps do community sees between native
> > ingestion vs Hadoop-based batch ingestion. And then work toward fixing
> > those gaps before dropping the Hadoop ingestion entirely. For example, if
> > users want the resource elasticity that a Hadoop cluster gives, we could
> > push forward PRs such as https://github.com/apache/druid/pull/10910.
> It's
> > not the same as a Hadoop cluster but nonetheless will let user reuse
> their
> > existing infrastructure to run druid jobs.
> >
> >> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino  wrote:
> >>
> >> It's always good to deprecate things for some time prior to removing
> them,
> >> so we don't need to (nor should we) remove Hadoop 2 support right now.
> My
> >> vote is that in this upcoming release, we should deprecate it. The main
> >> problem in my eyes is the one Abhishek brought up: the dependency
> >> management situation with Hadoop 2 is really messy, and I'm not sure
> >> there's a good way to handle them given the limited classloader
> isolation.
> >> This situation becomes tougher to manage with each release, and we
> haven't
> >> had people volunteering to find and build comprehensive solutions. It is
> >> time to move on.
> >>
> >> The concern Samarth raised, that people may end up stuck on older Druid
> >> versions because they aren't able to upgrade to Hadoop 3, is valid. I
> can
> >> see two good solutions to this. First: we can improve native ingest to
> the
> >> point where people feel broadly comfortable moving Hadoop 2 workloads to
> >> native. The work planned as part of doing ingest via multi-stage
> >> distributed query  is
> going
> >> to be useful here, by improving the speed and scalability of native
> ingest.
> >> Second: it would also be great to have something similar that runs on
> >> Spark, for people that have made investments in Spark. I suspect that
> most
> >> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so
> supporting
> >> both of those would ease a lot of the potential pain of dropping Hadoop
> 2
> >> support.
> >>
> >> On Spark: I'm not familiar with the current state of the Spark work. Is
> it
> >> stuck? If so could something be done to unstick it? I agree with
> Abhishek
> >> that I wouldn't want to block moving off Hadoop 2 on this. However,
> it'd be
> >> great if we could get it done before actually removing Hadoop 2 support
> >> from the code base.
> >>
> >>
> >> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal <
> abhishek.agar...@imply.io
> >>>
> >> wrote:
> >>
> >>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
> >>> low-resistance path than 

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x for 24.0

2022-08-22 Thread Julian Jaffe
For Spark support, the connector I wrote remains functional but I haven’t 
updated the PR for six months or so since it didn’t seem like there was an 
appetite for review. If that’s changing I could migrate back some more recent 
changes to the OSS PR. Even with an up-to-date patch though I see two problems:

First, I remain worried that there isn’t sufficient support among committers 
for the Spark connector. I don’t want Druid to end up in the same place it is 
now for Hadoop 2 support where no one really maintains the Spark code and we 
wind up with another awkward corner of the code base that holds back other 
development.

Secondly, the PR I have up is for Spark 2.4, which is now 2 years further out 
of date than it was back in 2020. Similarly to Hadoop there is a bifurcation in 
the community and Spark 2.4 is still in heavy use but we might be trading one 
problem for another if we deprecate Hadoop 2 in favor of Spark 2.4. I have 
written a Spark 3.2 connector as well but it’s been deployed to significantly 
smaller use cases than the 2.4 line.

Even with these two caveats, if there’s a desire among the Druid development 
community to add Spark functionality and support it I’d love to push this 
across the finish line.

> On Aug 9, 2022, at 1:04 AM, Abhishek Agarwal  
> wrote:
> 
> Yes. We should deprecate it first which is similar to dropping the support
> (no more active development) but we will still ship it for a release or
> two. In a way, we are already in that mode to a certain extent. Many
> features are being built with native ingestion as a first-class citizen.
> E.g. range partitioning is still not supported on Hadoop ingestion. It's
> hard for developers to build and test their business logic for all the
> ingestion modes.
> 
> It will be good to hear what gaps do community sees between native
> ingestion vs Hadoop-based batch ingestion. And then work toward fixing
> those gaps before dropping the Hadoop ingestion entirely. For example, if
> users want the resource elasticity that a Hadoop cluster gives, we could
> push forward PRs such as https://github.com/apache/druid/pull/10910. It's
> not the same as a Hadoop cluster but nonetheless will let user reuse their
> existing infrastructure to run druid jobs.
> 
>> On Tue, Aug 9, 2022 at 9:43 AM Gian Merlino  wrote:
>> 
>> It's always good to deprecate things for some time prior to removing them,
>> so we don't need to (nor should we) remove Hadoop 2 support right now. My
>> vote is that in this upcoming release, we should deprecate it. The main
>> problem in my eyes is the one Abhishek brought up: the dependency
>> management situation with Hadoop 2 is really messy, and I'm not sure
>> there's a good way to handle them given the limited classloader isolation.
>> This situation becomes tougher to manage with each release, and we haven't
>> had people volunteering to find and build comprehensive solutions. It is
>> time to move on.
>> 
>> The concern Samarth raised, that people may end up stuck on older Druid
>> versions because they aren't able to upgrade to Hadoop 3, is valid. I can
>> see two good solutions to this. First: we can improve native ingest to the
>> point where people feel broadly comfortable moving Hadoop 2 workloads to
>> native. The work planned as part of doing ingest via multi-stage
>> distributed query  is going
>> to be useful here, by improving the speed and scalability of native ingest.
>> Second: it would also be great to have something similar that runs on
>> Spark, for people that have made investments in Spark. I suspect that most
>> people that used Hadoop 2 have moved on to Hadoop 3 or Spark, so supporting
>> both of those would ease a lot of the potential pain of dropping Hadoop 2
>> support.
>> 
>> On Spark: I'm not familiar with the current state of the Spark work. Is it
>> stuck? If so could something be done to unstick it? I agree with Abhishek
>> that I wouldn't want to block moving off Hadoop 2 on this. However, it'd be
>> great if we could get it done before actually removing Hadoop 2 support
>> from the code base.
>> 
>> 
>> On Wed, Aug 3, 2022 at 6:17 AM Abhishek Agarwal >> 
>> wrote:
>> 
>>> I was thinking that moving from Hadoop 2 to Hadoop 3 will be a
>>> low-resistance path than moving from Hadoop to Spark. even if we get that
>>> PR merged, it will take good time for spark integration to reach the same
>>> level of maturity as Hadoop or Native ingestion. BTW I am not making an
>>> argument against spark integration. it will certainly be nice to have
>> Spark
>>> as an option. Just that spark integration doesn't become a blocker for us
>>> to get off Hadoop.
>>> 
>>> btw are you using Hadoop 2 right now with the latest druid version? If
>> so,
>>> did you run into similar errors that I posted in my last email?
>>> 
>>> On Wed, Jul 27, 2022 at 12:02 AM Samarth Jain 
>>> wrote:
>>> 
 I am sure there are other companies out there who are still on