Re: Spark Druid connectors, take 2

Will Xu Wed, 25 Oct 2023 06:25:26 -0700

Hi,
I want to revive this thread a bit. How's people's latest feeling about
getting Spark working? I want to see if we can coordinate a proposal
together.
Regards,
Will


On Thu, Aug 10, 2023 at 3:05 AM Maytas Monsereenusorn <mayt...@gmail.com>
wrote:

> Hi all,
>
> First of all, thank you Julian for bringing this up and starting the
> conversation.
> Just to chime in on our (Netflix) use cases.
> We use Spark 3.3 and would benefit from both reader and writer.
> For the writer, we currently have a Spark job that writes data (from a
> Spark job) to an intermediate Iceberg table. We would then separately issue
> a Druid batch ingestion to consume from this intermediate Iceberg table (by
> passing the S3 paths of the table). Having write support from within Spark
> job (to Druid) would help us eliminate this intermediate Iceberg table and
> simplify our workflow (possibly also reducing our storage and compute
> cost). To answer your question, I think this will be more aligned
> with having a spark job targeting a druid cluster.
> For the reader, we would like to be able to export data from Druid (such as
> moving Druid data into an Iceberg table) and also joining/further
> processing of Druid data with other (non-Druid) data (such as other Iceberg
> tables) within Spark jobs. To answer your question, I think this will be
> more aligned with the reader in Spark job reading Druid segment files
> directly.
>
> Thanks,
> Maytas
>
>
>
> On Wed, Aug 9, 2023 at 2:14 PM Rajiv Mordani <rmord...@vmware.com.invalid>
> wrote:
>
> > Will, Julian,
> >                 See responses below tagged with [Rajiv] in blue:
> >
> > From: Will Xu <will...@imply.io.INVALID>
> > Date: Tuesday, August 8, 2023 at 9:27 AM
> > To: dev@druid.apache.org <dev@druid.apache.org>
> > Subject: Re: Spark Druid connectors, take 2
> > !! External Email
> >
> > For which version to target, I think we should survey the Druid community
> > and get input. In your case, which version are you currently deploying?
> > Historical experience tells me we should target current and current-1
> > (3.4.x and 3.3.x)
> >
> >
> > [Rajiv] Version should be fine at least for our use cases.
> >
> >
> > In terms of the writer (Spark writes to Druid), what's the user workflow
> > you envision? Would you think the user would trigger a spark job from
> > Druid? Or is this user who is submitting a Spark job to target a Druid
> > cluster? The former allows other systems, like compaction, for example,
> to
> > use Spark as a runner.
> >
> >
> > [Rajiv] For us it is the latter. Where a spark job targets a druid
> cluster.
> >
> >
> > In terms of the reader (Spark reads Druid). I'm most curious to find out
> > what experience you are imagining. Should the reader be reading Druid
> > segment files or would the reader issue queries to Druid (maybe even to
> > historicals?) so that query can be parallelized?
> >
> >
> > [Rajiv] Segments is going to be tricky specially with things like
> > compaction etc. I think we definitely need to be able to query hot cache
> as
> > well. So not just segments / historicals.
> >
> >
> > Of the two, there is a lot more interest in the writer from the people
> I've
> > been talking to.
> >
> >
> > [Rajiv] We need both read and write for the different kinds of jobs.
> >
> > Responses to Julian’s asks in-line below:
> >
> > Regards,
> > Will
> >
> >
> > On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe <julianfja...@gmail.com>
> > wrote:
> >
> > > Hey all,
> > >
> > > There was talk earlier this year about resurrecting the effort to add
> > > direct Spark readers and writers to Druid. Rather than repeat the
> > previous
> > > attempt and parachute in with updated connectors, I’d like to start by
> > > building a little more consensus around what the Druid dev community
> > wants
> > > as potential maintainers.
> > >
> > > To begin with, I want to solicit opinions on two topics:
> > >
> > > Should these connectors be written in Scala or Java? The benefits of
> > Scala
> > > would be that the existing connectors are written in Scala, as are most
> > > open source references for Spark Datasource V2 implementations. The
> > > benefits of Java are that Druid is written in Java, and so engineers
> > > interested in contributing to Druid wouldn’t need to switch between
> > > languages. Additionally, existing tooling, static checkers, etc. could
> be
> > > used with minimal effort, conforming code style and developer
> ergonomics
> > > across Druid instead of needing to keep an alternate Scala tool chain
> in
> > > sync.
> >
> > [Rajiv] We need Java support.
> >
> >
> > > Which Spark version should this effort target? The most recently
> released
> > > version of Spark is 3.4.1. Should we aim to integrate with the latest
> > Spark
> > > minor version under the assumption that this will give us the longest
> > > window of support, or should we build against an older minor line (3.3?
> > > 3.2?) since most Spark users tend to lag? For reference, there are
> > > currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1.
> From
> > a
> > > user’s point of view, the API is mostly compatible across a major
> version
> > > (i.e. 3.x), while developer APIs such as the ones we would use to build
> > > these connectors can change between minor versions.
> > > There are quite a few nuances and trade offs inherent to the decisions
> > > above, and my hope is that by hashing these choices out before
> presenting
> > > an implementation we can build buy-in from the Druid maintainer
> community
> > > that will result in this effort succeeding where the first effort
> failed.
> >
> > [Rajiv] 3.4 (and above) will work for us.
> >
> > Thanks
> >
> >
> >   *   Rajiv
> >
> >
> >
> > >
> > > Thanks,
> > > Julian
> >
> > !! External Email: This email originated from outside of the
> organization.
> > Do not click links or open attachments unless you recognize the sender.
> >
>

Re: Spark Druid connectors, take 2

Reply via email to