Re: Spark Druid connectors, take 2

Maytas Monsereenusorn Thu, 26 Oct 2023 02:07:35 -0700

I am still in favor and support getting Spark Druid connector working and merge 
to master.
I am also still planning on taking this up but haven’t had time yet. If anyone 
else is interested and wants to pick this up, that would be great too!


Thanks,
Maytas
Sent from my iPhone

> On Oct 25, 2023, at 6:25 AM, Will Xu <2beth...@gmail.com> wrote:
> 
> Hi,
> I want to revive this thread a bit. How's people's latest feeling about
> getting Spark working? I want to see if we can coordinate a proposal
> together.
> Regards,
> Will
> 
>> On Thu, Aug 10, 2023 at 3:05 AM Maytas Monsereenusorn <mayt...@gmail.com>
>> wrote:
>> 
>> Hi all,
>> 
>> First of all, thank you Julian for bringing this up and starting the
>> conversation.
>> Just to chime in on our (Netflix) use cases.
>> We use Spark 3.3 and would benefit from both reader and writer.
>> For the writer, we currently have a Spark job that writes data (from a
>> Spark job) to an intermediate Iceberg table. We would then separately issue
>> a Druid batch ingestion to consume from this intermediate Iceberg table (by
>> passing the S3 paths of the table). Having write support from within Spark
>> job (to Druid) would help us eliminate this intermediate Iceberg table and
>> simplify our workflow (possibly also reducing our storage and compute
>> cost). To answer your question, I think this will be more aligned
>> with having a spark job targeting a druid cluster.
>> For the reader, we would like to be able to export data from Druid (such as
>> moving Druid data into an Iceberg table) and also joining/further
>> processing of Druid data with other (non-Druid) data (such as other Iceberg
>> tables) within Spark jobs. To answer your question, I think this will be
>> more aligned with the reader in Spark job reading Druid segment files
>> directly.
>> 
>> Thanks,
>> Maytas
>> 
>> 
>> 
>> On Wed, Aug 9, 2023 at 2:14 PM Rajiv Mordani <rmord...@vmware.com.invalid>
>> wrote:
>> 
>>> Will, Julian,
>>>                See responses below tagged with [Rajiv] in blue:
>>> 
>>> From: Will Xu <will...@imply.io.INVALID>
>>> Date: Tuesday, August 8, 2023 at 9:27 AM
>>> To: dev@druid.apache.org <dev@druid.apache.org>
>>> Subject: Re: Spark Druid connectors, take 2
>>> !! External Email
>>> 
>>> For which version to target, I think we should survey the Druid community
>>> and get input. In your case, which version are you currently deploying?
>>> Historical experience tells me we should target current and current-1
>>> (3.4.x and 3.3.x)
>>> 
>>> 
>>> [Rajiv] Version should be fine at least for our use cases.
>>> 
>>> 
>>> In terms of the writer (Spark writes to Druid), what's the user workflow
>>> you envision? Would you think the user would trigger a spark job from
>>> Druid? Or is this user who is submitting a Spark job to target a Druid
>>> cluster? The former allows other systems, like compaction, for example,
>> to
>>> use Spark as a runner.
>>> 
>>> 
>>> [Rajiv] For us it is the latter. Where a spark job targets a druid
>> cluster.
>>> 
>>> 
>>> In terms of the reader (Spark reads Druid). I'm most curious to find out
>>> what experience you are imagining. Should the reader be reading Druid
>>> segment files or would the reader issue queries to Druid (maybe even to
>>> historicals?) so that query can be parallelized?
>>> 
>>> 
>>> [Rajiv] Segments is going to be tricky specially with things like
>>> compaction etc. I think we definitely need to be able to query hot cache
>> as
>>> well. So not just segments / historicals.
>>> 
>>> 
>>> Of the two, there is a lot more interest in the writer from the people
>> I've
>>> been talking to.
>>> 
>>> 
>>> [Rajiv] We need both read and write for the different kinds of jobs.
>>> 
>>> Responses to Julian’s asks in-line below:
>>> 
>>> Regards,
>>> Will
>>> 
>>> 
>>> On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe <julianfja...@gmail.com>
>>> wrote:
>>> 
>>>> Hey all,
>>>> 
>>>> There was talk earlier this year about resurrecting the effort to add
>>>> direct Spark readers and writers to Druid. Rather than repeat the
>>> previous
>>>> attempt and parachute in with updated connectors, I’d like to start by
>>>> building a little more consensus around what the Druid dev community
>>> wants
>>>> as potential maintainers.
>>>> 
>>>> To begin with, I want to solicit opinions on two topics:
>>>> 
>>>> Should these connectors be written in Scala or Java? The benefits of
>>> Scala
>>>> would be that the existing connectors are written in Scala, as are most
>>>> open source references for Spark Datasource V2 implementations. The
>>>> benefits of Java are that Druid is written in Java, and so engineers
>>>> interested in contributing to Druid wouldn’t need to switch between
>>>> languages. Additionally, existing tooling, static checkers, etc. could
>> be
>>>> used with minimal effort, conforming code style and developer
>> ergonomics
>>>> across Druid instead of needing to keep an alternate Scala tool chain
>> in
>>>> sync.
>>> 
>>> [Rajiv] We need Java support.
>>> 
>>> 
>>>> Which Spark version should this effort target? The most recently
>> released
>>>> version of Spark is 3.4.1. Should we aim to integrate with the latest
>>> Spark
>>>> minor version under the assumption that this will give us the longest
>>>> window of support, or should we build against an older minor line (3.3?
>>>> 3.2?) since most Spark users tend to lag? For reference, there are
>>>> currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1.
>> From
>>> a
>>>> user’s point of view, the API is mostly compatible across a major
>> version
>>>> (i.e. 3.x), while developer APIs such as the ones we would use to build
>>>> these connectors can change between minor versions.
>>>> There are quite a few nuances and trade offs inherent to the decisions
>>>> above, and my hope is that by hashing these choices out before
>> presenting
>>>> an implementation we can build buy-in from the Druid maintainer
>> community
>>>> that will result in this effort succeeding where the first effort
>> failed.
>>> 
>>> [Rajiv] 3.4 (and above) will work for us.
>>> 
>>> Thanks
>>> 
>>> 
>>>  *   Rajiv
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Julian
>>> 
>>> !! External Email: This email originated from outside of the
>> organization.
>>> Do not click links or open attachments unless you recognize the sender.
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: Spark Druid connectors, take 2

Reply via email to