Hi, I want to revive this thread a bit. How's people's latest feeling about getting Spark working? I want to see if we can coordinate a proposal together. Regards, Will
On Thu, Aug 10, 2023 at 3:05 AM Maytas Monsereenusorn <mayt...@gmail.com> wrote: > Hi all, > > First of all, thank you Julian for bringing this up and starting the > conversation. > Just to chime in on our (Netflix) use cases. > We use Spark 3.3 and would benefit from both reader and writer. > For the writer, we currently have a Spark job that writes data (from a > Spark job) to an intermediate Iceberg table. We would then separately issue > a Druid batch ingestion to consume from this intermediate Iceberg table (by > passing the S3 paths of the table). Having write support from within Spark > job (to Druid) would help us eliminate this intermediate Iceberg table and > simplify our workflow (possibly also reducing our storage and compute > cost). To answer your question, I think this will be more aligned > with having a spark job targeting a druid cluster. > For the reader, we would like to be able to export data from Druid (such as > moving Druid data into an Iceberg table) and also joining/further > processing of Druid data with other (non-Druid) data (such as other Iceberg > tables) within Spark jobs. To answer your question, I think this will be > more aligned with the reader in Spark job reading Druid segment files > directly. > > Thanks, > Maytas > > > > On Wed, Aug 9, 2023 at 2:14 PM Rajiv Mordani <rmord...@vmware.com.invalid> > wrote: > > > Will, Julian, > > See responses below tagged with [Rajiv] in blue: > > > > From: Will Xu <will...@imply.io.INVALID> > > Date: Tuesday, August 8, 2023 at 9:27 AM > > To: dev@druid.apache.org <dev@druid.apache.org> > > Subject: Re: Spark Druid connectors, take 2 > > !! External Email > > > > For which version to target, I think we should survey the Druid community > > and get input. In your case, which version are you currently deploying? > > Historical experience tells me we should target current and current-1 > > (3.4.x and 3.3.x) > > > > > > [Rajiv] Version should be fine at least for our use cases. > > > > > > In terms of the writer (Spark writes to Druid), what's the user workflow > > you envision? Would you think the user would trigger a spark job from > > Druid? Or is this user who is submitting a Spark job to target a Druid > > cluster? The former allows other systems, like compaction, for example, > to > > use Spark as a runner. > > > > > > [Rajiv] For us it is the latter. Where a spark job targets a druid > cluster. > > > > > > In terms of the reader (Spark reads Druid). I'm most curious to find out > > what experience you are imagining. Should the reader be reading Druid > > segment files or would the reader issue queries to Druid (maybe even to > > historicals?) so that query can be parallelized? > > > > > > [Rajiv] Segments is going to be tricky specially with things like > > compaction etc. I think we definitely need to be able to query hot cache > as > > well. So not just segments / historicals. > > > > > > Of the two, there is a lot more interest in the writer from the people > I've > > been talking to. > > > > > > [Rajiv] We need both read and write for the different kinds of jobs. > > > > Responses to Julian’s asks in-line below: > > > > Regards, > > Will > > > > > > On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe <julianfja...@gmail.com> > > wrote: > > > > > Hey all, > > > > > > There was talk earlier this year about resurrecting the effort to add > > > direct Spark readers and writers to Druid. Rather than repeat the > > previous > > > attempt and parachute in with updated connectors, I’d like to start by > > > building a little more consensus around what the Druid dev community > > wants > > > as potential maintainers. > > > > > > To begin with, I want to solicit opinions on two topics: > > > > > > Should these connectors be written in Scala or Java? The benefits of > > Scala > > > would be that the existing connectors are written in Scala, as are most > > > open source references for Spark Datasource V2 implementations. The > > > benefits of Java are that Druid is written in Java, and so engineers > > > interested in contributing to Druid wouldn’t need to switch between > > > languages. Additionally, existing tooling, static checkers, etc. could > be > > > used with minimal effort, conforming code style and developer > ergonomics > > > across Druid instead of needing to keep an alternate Scala tool chain > in > > > sync. > > > > [Rajiv] We need Java support. > > > > > > > Which Spark version should this effort target? The most recently > released > > > version of Spark is 3.4.1. Should we aim to integrate with the latest > > Spark > > > minor version under the assumption that this will give us the longest > > > window of support, or should we build against an older minor line (3.3? > > > 3.2?) since most Spark users tend to lag? For reference, there are > > > currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. > From > > a > > > user’s point of view, the API is mostly compatible across a major > version > > > (i.e. 3.x), while developer APIs such as the ones we would use to build > > > these connectors can change between minor versions. > > > There are quite a few nuances and trade offs inherent to the decisions > > > above, and my hope is that by hashing these choices out before > presenting > > > an implementation we can build buy-in from the Druid maintainer > community > > > that will result in this effort succeeding where the first effort > failed. > > > > [Rajiv] 3.4 (and above) will work for us. > > > > Thanks > > > > > > * Rajiv > > > > > > > > > > > > Thanks, > > > Julian > > > > !! External Email: This email originated from outside of the > organization. > > Do not click links or open attachments unless you recognize the sender. > > >