Re: Spark Druid connectors, take 2

2023-10-26 Thread Maytas Monsereenusorn
I am still in favor and support getting Spark Druid connector working and merge 
to master.
I am also still planning on taking this up but haven’t had time yet. If anyone 
else is interested and wants to pick this up, that would be great too!

Thanks,
Maytas
Sent from my iPhone

> On Oct 25, 2023, at 6:25 AM, Will Xu <2beth...@gmail.com> wrote:
> 
> Hi,
> I want to revive this thread a bit. How's people's latest feeling about
> getting Spark working? I want to see if we can coordinate a proposal
> together.
> Regards,
> Will
> 
>> On Thu, Aug 10, 2023 at 3:05 AM Maytas Monsereenusorn 
>> wrote:
>> 
>> Hi all,
>> 
>> First of all, thank you Julian for bringing this up and starting the
>> conversation.
>> Just to chime in on our (Netflix) use cases.
>> We use Spark 3.3 and would benefit from both reader and writer.
>> For the writer, we currently have a Spark job that writes data (from a
>> Spark job) to an intermediate Iceberg table. We would then separately issue
>> a Druid batch ingestion to consume from this intermediate Iceberg table (by
>> passing the S3 paths of the table). Having write support from within Spark
>> job (to Druid) would help us eliminate this intermediate Iceberg table and
>> simplify our workflow (possibly also reducing our storage and compute
>> cost). To answer your question, I think this will be more aligned
>> with having a spark job targeting a druid cluster.
>> For the reader, we would like to be able to export data from Druid (such as
>> moving Druid data into an Iceberg table) and also joining/further
>> processing of Druid data with other (non-Druid) data (such as other Iceberg
>> tables) within Spark jobs. To answer your question, I think this will be
>> more aligned with the reader in Spark job reading Druid segment files
>> directly.
>> 
>> Thanks,
>> Maytas
>> 
>> 
>> 
>> On Wed, Aug 9, 2023 at 2:14 PM Rajiv Mordani 
>> wrote:
>> 
>>> Will, Julian,
>>>See responses below tagged with [Rajiv] in blue:
>>> 
>>> From: Will Xu 
>>> Date: Tuesday, August 8, 2023 at 9:27 AM
>>> To: dev@druid.apache.org 
>>> Subject: Re: Spark Druid connectors, take 2
>>> !! External Email
>>> 
>>> For which version to target, I think we should survey the Druid community
>>> and get input. In your case, which version are you currently deploying?
>>> Historical experience tells me we should target current and current-1
>>> (3.4.x and 3.3.x)
>>> 
>>> 
>>> [Rajiv] Version should be fine at least for our use cases.
>>> 
>>> 
>>> In terms of the writer (Spark writes to Druid), what's the user workflow
>>> you envision? Would you think the user would trigger a spark job from
>>> Druid? Or is this user who is submitting a Spark job to target a Druid
>>> cluster? The former allows other systems, like compaction, for example,
>> to
>>> use Spark as a runner.
>>> 
>>> 
>>> [Rajiv] For us it is the latter. Where a spark job targets a druid
>> cluster.
>>> 
>>> 
>>> In terms of the reader (Spark reads Druid). I'm most curious to find out
>>> what experience you are imagining. Should the reader be reading Druid
>>> segment files or would the reader issue queries to Druid (maybe even to
>>> historicals?) so that query can be parallelized?
>>> 
>>> 
>>> [Rajiv] Segments is going to be tricky specially with things like
>>> compaction etc. I think we definitely need to be able to query hot cache
>> as
>>> well. So not just segments / historicals.
>>> 
>>> 
>>> Of the two, there is a lot more interest in the writer from the people
>> I've
>>> been talking to.
>>> 
>>> 
>>> [Rajiv] We need both read and write for the different kinds of jobs.
>>> 
>>> Responses to Julian’s asks in-line below:
>>> 
>>> Regards,
>>> Will
>>> 
>>> 
>>> On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe 
>>> wrote:
>>> 
>>>> Hey all,
>>>> 
>>>> There was talk earlier this year about resurrecting the effort to add
>>>> direct Spark readers and writers to Druid. Rather than repeat the
>>> previous
>>>> attempt and parachute in with updated connectors, I’d like to start by
>>>> building a little more consensus around what the Druid dev community
>>> wants
>>>> as potential maintainers.
>>>>

Re: Spark Druid connectors, take 2

2023-10-25 Thread Will Xu
Hi,
I want to revive this thread a bit. How's people's latest feeling about
getting Spark working? I want to see if we can coordinate a proposal
together.
Regards,
Will

On Thu, Aug 10, 2023 at 3:05 AM Maytas Monsereenusorn 
wrote:

> Hi all,
>
> First of all, thank you Julian for bringing this up and starting the
> conversation.
> Just to chime in on our (Netflix) use cases.
> We use Spark 3.3 and would benefit from both reader and writer.
> For the writer, we currently have a Spark job that writes data (from a
> Spark job) to an intermediate Iceberg table. We would then separately issue
> a Druid batch ingestion to consume from this intermediate Iceberg table (by
> passing the S3 paths of the table). Having write support from within Spark
> job (to Druid) would help us eliminate this intermediate Iceberg table and
> simplify our workflow (possibly also reducing our storage and compute
> cost). To answer your question, I think this will be more aligned
> with having a spark job targeting a druid cluster.
> For the reader, we would like to be able to export data from Druid (such as
> moving Druid data into an Iceberg table) and also joining/further
> processing of Druid data with other (non-Druid) data (such as other Iceberg
> tables) within Spark jobs. To answer your question, I think this will be
> more aligned with the reader in Spark job reading Druid segment files
> directly.
>
> Thanks,
> Maytas
>
>
>
> On Wed, Aug 9, 2023 at 2:14 PM Rajiv Mordani 
> wrote:
>
> > Will, Julian,
> > See responses below tagged with [Rajiv] in blue:
> >
> > From: Will Xu 
> > Date: Tuesday, August 8, 2023 at 9:27 AM
> > To: dev@druid.apache.org 
> > Subject: Re: Spark Druid connectors, take 2
> > !! External Email
> >
> > For which version to target, I think we should survey the Druid community
> > and get input. In your case, which version are you currently deploying?
> > Historical experience tells me we should target current and current-1
> > (3.4.x and 3.3.x)
> >
> >
> > [Rajiv] Version should be fine at least for our use cases.
> >
> >
> > In terms of the writer (Spark writes to Druid), what's the user workflow
> > you envision? Would you think the user would trigger a spark job from
> > Druid? Or is this user who is submitting a Spark job to target a Druid
> > cluster? The former allows other systems, like compaction, for example,
> to
> > use Spark as a runner.
> >
> >
> > [Rajiv] For us it is the latter. Where a spark job targets a druid
> cluster.
> >
> >
> > In terms of the reader (Spark reads Druid). I'm most curious to find out
> > what experience you are imagining. Should the reader be reading Druid
> > segment files or would the reader issue queries to Druid (maybe even to
> > historicals?) so that query can be parallelized?
> >
> >
> > [Rajiv] Segments is going to be tricky specially with things like
> > compaction etc. I think we definitely need to be able to query hot cache
> as
> > well. So not just segments / historicals.
> >
> >
> > Of the two, there is a lot more interest in the writer from the people
> I've
> > been talking to.
> >
> >
> > [Rajiv] We need both read and write for the different kinds of jobs.
> >
> > Responses to Julian’s asks in-line below:
> >
> > Regards,
> > Will
> >
> >
> > On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe 
> > wrote:
> >
> > > Hey all,
> > >
> > > There was talk earlier this year about resurrecting the effort to add
> > > direct Spark readers and writers to Druid. Rather than repeat the
> > previous
> > > attempt and parachute in with updated connectors, I’d like to start by
> > > building a little more consensus around what the Druid dev community
> > wants
> > > as potential maintainers.
> > >
> > > To begin with, I want to solicit opinions on two topics:
> > >
> > > Should these connectors be written in Scala or Java? The benefits of
> > Scala
> > > would be that the existing connectors are written in Scala, as are most
> > > open source references for Spark Datasource V2 implementations. The
> > > benefits of Java are that Druid is written in Java, and so engineers
> > > interested in contributing to Druid wouldn’t need to switch between
> > > languages. Additionally, existing tooling, static checkers, etc. could
> be
> > > used with minimal effort, conforming code style and developer
> ergonomics
> > > across Druid instead of needing to keep a

Re: Spark Druid connectors, take 2

2023-08-09 Thread Maytas Monsereenusorn
Hi all,

First of all, thank you Julian for bringing this up and starting the
conversation.
Just to chime in on our (Netflix) use cases.
We use Spark 3.3 and would benefit from both reader and writer.
For the writer, we currently have a Spark job that writes data (from a
Spark job) to an intermediate Iceberg table. We would then separately issue
a Druid batch ingestion to consume from this intermediate Iceberg table (by
passing the S3 paths of the table). Having write support from within Spark
job (to Druid) would help us eliminate this intermediate Iceberg table and
simplify our workflow (possibly also reducing our storage and compute
cost). To answer your question, I think this will be more aligned
with having a spark job targeting a druid cluster.
For the reader, we would like to be able to export data from Druid (such as
moving Druid data into an Iceberg table) and also joining/further
processing of Druid data with other (non-Druid) data (such as other Iceberg
tables) within Spark jobs. To answer your question, I think this will be
more aligned with the reader in Spark job reading Druid segment files
directly.

Thanks,
Maytas



On Wed, Aug 9, 2023 at 2:14 PM Rajiv Mordani 
wrote:

> Will, Julian,
> See responses below tagged with [Rajiv] in blue:
>
> From: Will Xu 
> Date: Tuesday, August 8, 2023 at 9:27 AM
> To: dev@druid.apache.org 
> Subject: Re: Spark Druid connectors, take 2
> !! External Email
>
> For which version to target, I think we should survey the Druid community
> and get input. In your case, which version are you currently deploying?
> Historical experience tells me we should target current and current-1
> (3.4.x and 3.3.x)
>
>
> [Rajiv] Version should be fine at least for our use cases.
>
>
> In terms of the writer (Spark writes to Druid), what's the user workflow
> you envision? Would you think the user would trigger a spark job from
> Druid? Or is this user who is submitting a Spark job to target a Druid
> cluster? The former allows other systems, like compaction, for example, to
> use Spark as a runner.
>
>
> [Rajiv] For us it is the latter. Where a spark job targets a druid cluster.
>
>
> In terms of the reader (Spark reads Druid). I'm most curious to find out
> what experience you are imagining. Should the reader be reading Druid
> segment files or would the reader issue queries to Druid (maybe even to
> historicals?) so that query can be parallelized?
>
>
> [Rajiv] Segments is going to be tricky specially with things like
> compaction etc. I think we definitely need to be able to query hot cache as
> well. So not just segments / historicals.
>
>
> Of the two, there is a lot more interest in the writer from the people I've
> been talking to.
>
>
> [Rajiv] We need both read and write for the different kinds of jobs.
>
> Responses to Julian’s asks in-line below:
>
> Regards,
> Will
>
>
> On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe 
> wrote:
>
> > Hey all,
> >
> > There was talk earlier this year about resurrecting the effort to add
> > direct Spark readers and writers to Druid. Rather than repeat the
> previous
> > attempt and parachute in with updated connectors, I’d like to start by
> > building a little more consensus around what the Druid dev community
> wants
> > as potential maintainers.
> >
> > To begin with, I want to solicit opinions on two topics:
> >
> > Should these connectors be written in Scala or Java? The benefits of
> Scala
> > would be that the existing connectors are written in Scala, as are most
> > open source references for Spark Datasource V2 implementations. The
> > benefits of Java are that Druid is written in Java, and so engineers
> > interested in contributing to Druid wouldn’t need to switch between
> > languages. Additionally, existing tooling, static checkers, etc. could be
> > used with minimal effort, conforming code style and developer ergonomics
> > across Druid instead of needing to keep an alternate Scala tool chain in
> > sync.
>
> [Rajiv] We need Java support.
>
>
> > Which Spark version should this effort target? The most recently released
> > version of Spark is 3.4.1. Should we aim to integrate with the latest
> Spark
> > minor version under the assumption that this will give us the longest
> > window of support, or should we build against an older minor line (3.3?
> > 3.2?) since most Spark users tend to lag? For reference, there are
> > currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From
> a
> > user’s point of view, the API is mostly compatible across a major version
> > (i.e. 3.x), while developer APIs such as the ones we would use to build
> > th

Re: Spark Druid connectors, take 2

2023-08-09 Thread Rajiv Mordani
Looks like the “blue” fonts didn’t go through to the mailing list . However it 
is still tagged with [Rajiv].


  *   Rajiv

From: Rajiv Mordani 
Date: Wednesday, August 9, 2023 at 2:14 PM
To: dev@druid.apache.org 
Subject: Re: Spark Druid connectors, take 2
!! External Email

Will, Julian,
See responses below tagged with [Rajiv] in blue:

From: Will Xu 
Date: Tuesday, August 8, 2023 at 9:27 AM
To: dev@druid.apache.org 
Subject: Re: Spark Druid connectors, take 2
!! External Email

For which version to target, I think we should survey the Druid community
and get input. In your case, which version are you currently deploying?
Historical experience tells me we should target current and current-1
(3.4.x and 3.3.x)


[Rajiv] Version should be fine at least for our use cases.


In terms of the writer (Spark writes to Druid), what's the user workflow
you envision? Would you think the user would trigger a spark job from
Druid? Or is this user who is submitting a Spark job to target a Druid
cluster? The former allows other systems, like compaction, for example, to
use Spark as a runner.


[Rajiv] For us it is the latter. Where a spark job targets a druid cluster.


In terms of the reader (Spark reads Druid). I'm most curious to find out
what experience you are imagining. Should the reader be reading Druid
segment files or would the reader issue queries to Druid (maybe even to
historicals?) so that query can be parallelized?


[Rajiv] Segments is going to be tricky specially with things like compaction 
etc. I think we definitely need to be able to query hot cache as well. So not 
just segments / historicals.


Of the two, there is a lot more interest in the writer from the people I've
been talking to.


[Rajiv] We need both read and write for the different kinds of jobs.

Responses to Julian’s asks in-line below:

Regards,
Will


On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe  wrote:

> Hey all,
>
> There was talk earlier this year about resurrecting the effort to add
> direct Spark readers and writers to Druid. Rather than repeat the previous
> attempt and parachute in with updated connectors, I’d like to start by
> building a little more consensus around what the Druid dev community wants
> as potential maintainers.
>
> To begin with, I want to solicit opinions on two topics:
>
> Should these connectors be written in Scala or Java? The benefits of Scala
> would be that the existing connectors are written in Scala, as are most
> open source references for Spark Datasource V2 implementations. The
> benefits of Java are that Druid is written in Java, and so engineers
> interested in contributing to Druid wouldn’t need to switch between
> languages. Additionally, existing tooling, static checkers, etc. could be
> used with minimal effort, conforming code style and developer ergonomics
> across Druid instead of needing to keep an alternate Scala tool chain in
> sync.

[Rajiv] We need Java support.


> Which Spark version should this effort target? The most recently released
> version of Spark is 3.4.1. Should we aim to integrate with the latest Spark
> minor version under the assumption that this will give us the longest
> window of support, or should we build against an older minor line (3.3?
> 3.2?) since most Spark users tend to lag? For reference, there are
> currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From a
> user’s point of view, the API is mostly compatible across a major version
> (i.e. 3.x), while developer APIs such as the ones we would use to build
> these connectors can change between minor versions.
> There are quite a few nuances and trade offs inherent to the decisions
> above, and my hope is that by hashing these choices out before presenting
> an implementation we can build buy-in from the Druid maintainer community
> that will result in this effort succeeding where the first effort failed.

[Rajiv] 3.4 (and above) will work for us.

Thanks


  *   Rajiv



>
> Thanks,
> Julian

!! External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.


Re: Spark Druid connectors, take 2

2023-08-09 Thread Rajiv Mordani
Will, Julian,
See responses below tagged with [Rajiv] in blue:

From: Will Xu 
Date: Tuesday, August 8, 2023 at 9:27 AM
To: dev@druid.apache.org 
Subject: Re: Spark Druid connectors, take 2
!! External Email

For which version to target, I think we should survey the Druid community
and get input. In your case, which version are you currently deploying?
Historical experience tells me we should target current and current-1
(3.4.x and 3.3.x)


[Rajiv] Version should be fine at least for our use cases.


In terms of the writer (Spark writes to Druid), what's the user workflow
you envision? Would you think the user would trigger a spark job from
Druid? Or is this user who is submitting a Spark job to target a Druid
cluster? The former allows other systems, like compaction, for example, to
use Spark as a runner.


[Rajiv] For us it is the latter. Where a spark job targets a druid cluster.


In terms of the reader (Spark reads Druid). I'm most curious to find out
what experience you are imagining. Should the reader be reading Druid
segment files or would the reader issue queries to Druid (maybe even to
historicals?) so that query can be parallelized?


[Rajiv] Segments is going to be tricky specially with things like compaction 
etc. I think we definitely need to be able to query hot cache as well. So not 
just segments / historicals.


Of the two, there is a lot more interest in the writer from the people I've
been talking to.


[Rajiv] We need both read and write for the different kinds of jobs.

Responses to Julian’s asks in-line below:

Regards,
Will


On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe  wrote:

> Hey all,
>
> There was talk earlier this year about resurrecting the effort to add
> direct Spark readers and writers to Druid. Rather than repeat the previous
> attempt and parachute in with updated connectors, I’d like to start by
> building a little more consensus around what the Druid dev community wants
> as potential maintainers.
>
> To begin with, I want to solicit opinions on two topics:
>
> Should these connectors be written in Scala or Java? The benefits of Scala
> would be that the existing connectors are written in Scala, as are most
> open source references for Spark Datasource V2 implementations. The
> benefits of Java are that Druid is written in Java, and so engineers
> interested in contributing to Druid wouldn’t need to switch between
> languages. Additionally, existing tooling, static checkers, etc. could be
> used with minimal effort, conforming code style and developer ergonomics
> across Druid instead of needing to keep an alternate Scala tool chain in
> sync.

[Rajiv] We need Java support.


> Which Spark version should this effort target? The most recently released
> version of Spark is 3.4.1. Should we aim to integrate with the latest Spark
> minor version under the assumption that this will give us the longest
> window of support, or should we build against an older minor line (3.3?
> 3.2?) since most Spark users tend to lag? For reference, there are
> currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From a
> user’s point of view, the API is mostly compatible across a major version
> (i.e. 3.x), while developer APIs such as the ones we would use to build
> these connectors can change between minor versions.
> There are quite a few nuances and trade offs inherent to the decisions
> above, and my hope is that by hashing these choices out before presenting
> an implementation we can build buy-in from the Druid maintainer community
> that will result in this effort succeeding where the first effort failed.

[Rajiv] 3.4 (and above) will work for us.

Thanks


  *   Rajiv



>
> Thanks,
> Julian

!! External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.


Re: Spark Druid connectors, take 2

2023-08-09 Thread Will Xu
Yes, it does make sense.
For #2 (Spark reads Druid), I think Spark also needs to be able to get
schema from Druid. This is probably a query to the broker.
I wonder what's the UX look like for Spark SQL users on how they specify
the schema. Would they create an EXTERNAL TABLE in Spark that maps to a
Druid datasource? Or would that be something users specify as part of table
property? (I think those are good things to cover in the design proposal.)

Regards,
Will


On Wed, Aug 9, 2023 at 2:42 AM Itai Yaffe  wrote:

> For proper disclosure, it's been a while since I used Druid, but here's my
> 2 cents w.r.t Will's question (based on what I originally wrote in this
> design doc
> <
> https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit#
> >
> ):
>
>1. *Spark writes to Druid*:
>   1. Based on what I've seen, the latter would be the more common
>   choice, i.e. *I would assume most users would execute an external
>   Spark job* (external to Druid, that is), e.g. from Databricks/EMR/...
>   That job would process data, and write the output into Druid (in the
>   form of Druid segments directly into Druid's deep storage, plus the
>   required updates to Druid's metadata).
>   2. If the community chooses to go down that route, I think it's also
>   possible to execute other operations (e.g compaction) from external
> Spark
>   jobs, since they are sort of ingestion jobs (if you think about it
> at a
>   high level), as they read Druid segments and write new Druid
> segments.
>2. *Spark reads from Druid*:
>   1. You can already issue queries to Druid from Spark using JDBC, so
>   in this case, the more appealing option, I think, is *to be able to
>   read segment files directly* (especially for extremely heavy
> queries).
>   2. In addition, you'd need to implement the ability for Spark to read
>   segment files directly, in order to support Druid->Druid ingestion
> (i.e
>   where your input is another Druid datasource), as well as to support
>   compaction tasks (IIRC).
>3. Generally speaking, I agree with your observation w.r.t the bigger
>interest in the writer.
>The only comment here is that some additional benefits of the writer
>(e.g Druid->Druid ingestion, support for compaction tasks) depend on
>implementing the reader.
>
> Hope that helps 
>
> Thanks!
>
> On Tue, 8 Aug 2023 at 19:27, Will Xu  wrote:
>
> > For which version to target, I think we should survey the Druid community
> > and get input. In your case, which version are you currently deploying?
> > Historical experience tells me we should target current and current-1
> > (3.4.x and 3.3.x)
> >
> > In terms of the writer (Spark writes to Druid), what's the user workflow
> > you envision? Would you think the user would trigger a spark job from
> > Druid? Or is this user who is submitting a Spark job to target a Druid
> > cluster? The former allows other systems, like compaction, for example,
> to
> > use Spark as a runner.
> >
> > In terms of the reader (Spark reads Druid). I'm most curious to find out
> > what experience you are imagining. Should the reader be reading Druid
> > segment files or would the reader issue queries to Druid (maybe even to
> > historicals?) so that query can be parallelized?
> >
> > Of the two, there is a lot more interest in the writer from the people
> I've
> > been talking to.
> >
> > Regards,
> > Will
> >
> >
> > On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe 
> > wrote:
> >
> > > Hey all,
> > >
> > > There was talk earlier this year about resurrecting the effort to add
> > > direct Spark readers and writers to Druid. Rather than repeat the
> > previous
> > > attempt and parachute in with updated connectors, I’d like to start by
> > > building a little more consensus around what the Druid dev community
> > wants
> > > as potential maintainers.
> > >
> > > To begin with, I want to solicit opinions on two topics:
> > >
> > > Should these connectors be written in Scala or Java? The benefits of
> > Scala
> > > would be that the existing connectors are written in Scala, as are most
> > > open source references for Spark Datasource V2 implementations. The
> > > benefits of Java are that Druid is written in Java, and so engineers
> > > interested in contributing to Druid wouldn’t need to switch between
> > > languages. Additionally, existing tooling, static checkers, etc. could
> be
> > > used with minimal effort, conforming code style and developer
> ergonomics
> > > across Druid instead of needing to keep an alternate Scala tool chain
> in
> > > sync.
> > > Which Spark version should this effort target? The most recently
> released
> > > version of Spark is 3.4.1. Should we aim to integrate with the latest
> > Spark
> > > minor version under the assumption that this will give us the longest
> > > window of support, or should we build against an older minor line (3.3?
> > > 3.2?) since most Spark users 

Re: Spark Druid connectors, take 2

2023-08-09 Thread Itai Yaffe
For proper disclosure, it's been a while since I used Druid, but here's my
2 cents w.r.t Will's question (based on what I originally wrote in this
design doc

):

   1. *Spark writes to Druid*:
  1. Based on what I've seen, the latter would be the more common
  choice, i.e. *I would assume most users would execute an external
  Spark job* (external to Druid, that is), e.g. from Databricks/EMR/...
  That job would process data, and write the output into Druid (in the
  form of Druid segments directly into Druid's deep storage, plus the
  required updates to Druid's metadata).
  2. If the community chooses to go down that route, I think it's also
  possible to execute other operations (e.g compaction) from external Spark
  jobs, since they are sort of ingestion jobs (if you think about it at a
  high level), as they read Druid segments and write new Druid segments.
   2. *Spark reads from Druid*:
  1. You can already issue queries to Druid from Spark using JDBC, so
  in this case, the more appealing option, I think, is *to be able to
  read segment files directly* (especially for extremely heavy queries).
  2. In addition, you'd need to implement the ability for Spark to read
  segment files directly, in order to support Druid->Druid ingestion (i.e
  where your input is another Druid datasource), as well as to support
  compaction tasks (IIRC).
   3. Generally speaking, I agree with your observation w.r.t the bigger
   interest in the writer.
   The only comment here is that some additional benefits of the writer
   (e.g Druid->Druid ingestion, support for compaction tasks) depend on
   implementing the reader.

Hope that helps 

Thanks!

On Tue, 8 Aug 2023 at 19:27, Will Xu  wrote:

> For which version to target, I think we should survey the Druid community
> and get input. In your case, which version are you currently deploying?
> Historical experience tells me we should target current and current-1
> (3.4.x and 3.3.x)
>
> In terms of the writer (Spark writes to Druid), what's the user workflow
> you envision? Would you think the user would trigger a spark job from
> Druid? Or is this user who is submitting a Spark job to target a Druid
> cluster? The former allows other systems, like compaction, for example, to
> use Spark as a runner.
>
> In terms of the reader (Spark reads Druid). I'm most curious to find out
> what experience you are imagining. Should the reader be reading Druid
> segment files or would the reader issue queries to Druid (maybe even to
> historicals?) so that query can be parallelized?
>
> Of the two, there is a lot more interest in the writer from the people I've
> been talking to.
>
> Regards,
> Will
>
>
> On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe 
> wrote:
>
> > Hey all,
> >
> > There was talk earlier this year about resurrecting the effort to add
> > direct Spark readers and writers to Druid. Rather than repeat the
> previous
> > attempt and parachute in with updated connectors, I’d like to start by
> > building a little more consensus around what the Druid dev community
> wants
> > as potential maintainers.
> >
> > To begin with, I want to solicit opinions on two topics:
> >
> > Should these connectors be written in Scala or Java? The benefits of
> Scala
> > would be that the existing connectors are written in Scala, as are most
> > open source references for Spark Datasource V2 implementations. The
> > benefits of Java are that Druid is written in Java, and so engineers
> > interested in contributing to Druid wouldn’t need to switch between
> > languages. Additionally, existing tooling, static checkers, etc. could be
> > used with minimal effort, conforming code style and developer ergonomics
> > across Druid instead of needing to keep an alternate Scala tool chain in
> > sync.
> > Which Spark version should this effort target? The most recently released
> > version of Spark is 3.4.1. Should we aim to integrate with the latest
> Spark
> > minor version under the assumption that this will give us the longest
> > window of support, or should we build against an older minor line (3.3?
> > 3.2?) since most Spark users tend to lag? For reference, there are
> > currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From
> a
> > user’s point of view, the API is mostly compatible across a major version
> > (i.e. 3.x), while developer APIs such as the ones we would use to build
> > these connectors can change between minor versions.
> > There are quite a few nuances and trade offs inherent to the decisions
> > above, and my hope is that by hashing these choices out before presenting
> > an implementation we can build buy-in from the Druid maintainer community
> > that will result in this effort succeeding where the first effort failed.
> >
> > Thanks,
> > Julian
>


Re: Spark Druid connectors, take 2

2023-08-08 Thread Will Xu
For which version to target, I think we should survey the Druid community
and get input. In your case, which version are you currently deploying?
Historical experience tells me we should target current and current-1
(3.4.x and 3.3.x)

In terms of the writer (Spark writes to Druid), what's the user workflow
you envision? Would you think the user would trigger a spark job from
Druid? Or is this user who is submitting a Spark job to target a Druid
cluster? The former allows other systems, like compaction, for example, to
use Spark as a runner.

In terms of the reader (Spark reads Druid). I'm most curious to find out
what experience you are imagining. Should the reader be reading Druid
segment files or would the reader issue queries to Druid (maybe even to
historicals?) so that query can be parallelized?

Of the two, there is a lot more interest in the writer from the people I've
been talking to.

Regards,
Will


On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe  wrote:

> Hey all,
>
> There was talk earlier this year about resurrecting the effort to add
> direct Spark readers and writers to Druid. Rather than repeat the previous
> attempt and parachute in with updated connectors, I’d like to start by
> building a little more consensus around what the Druid dev community wants
> as potential maintainers.
>
> To begin with, I want to solicit opinions on two topics:
>
> Should these connectors be written in Scala or Java? The benefits of Scala
> would be that the existing connectors are written in Scala, as are most
> open source references for Spark Datasource V2 implementations. The
> benefits of Java are that Druid is written in Java, and so engineers
> interested in contributing to Druid wouldn’t need to switch between
> languages. Additionally, existing tooling, static checkers, etc. could be
> used with minimal effort, conforming code style and developer ergonomics
> across Druid instead of needing to keep an alternate Scala tool chain in
> sync.
> Which Spark version should this effort target? The most recently released
> version of Spark is 3.4.1. Should we aim to integrate with the latest Spark
> minor version under the assumption that this will give us the longest
> window of support, or should we build against an older minor line (3.3?
> 3.2?) since most Spark users tend to lag? For reference, there are
> currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From a
> user’s point of view, the API is mostly compatible across a major version
> (i.e. 3.x), while developer APIs such as the ones we would use to build
> these connectors can change between minor versions.
> There are quite a few nuances and trade offs inherent to the decisions
> above, and my hope is that by hashing these choices out before presenting
> an implementation we can build buy-in from the Druid maintainer community
> that will result in this effort succeeding where the first effort failed.
>
> Thanks,
> Julian


Spark Druid connectors, take 2

2023-08-08 Thread Julian Jaffe
Hey all,

There was talk earlier this year about resurrecting the effort to add direct 
Spark readers and writers to Druid. Rather than repeat the previous attempt and 
parachute in with updated connectors, I’d like to start by building a little 
more consensus around what the Druid dev community wants as potential 
maintainers.

To begin with, I want to solicit opinions on two topics:

Should these connectors be written in Scala or Java? The benefits of Scala 
would be that the existing connectors are written in Scala, as are most open 
source references for Spark Datasource V2 implementations. The benefits of Java 
are that Druid is written in Java, and so engineers interested in contributing 
to Druid wouldn’t need to switch between languages. Additionally, existing 
tooling, static checkers, etc. could be used with minimal effort, conforming 
code style and developer ergonomics across Druid instead of needing to keep an 
alternate Scala tool chain in sync.
Which Spark version should this effort target? The most recently released 
version of Spark is 3.4.1. Should we aim to integrate with the latest Spark 
minor version under the assumption that this will give us the longest window of 
support, or should we build against an older minor line (3.3? 3.2?) since most 
Spark users tend to lag? For reference, there are currently 3 stable Spark 
release versions, 3.2.4, 3.3.2, and 3.4.1. From a user’s point of view, the API 
is mostly compatible across a major version (i.e. 3.x), while developer APIs 
such as the ones we would use to build these connectors can change between 
minor versions.
There are quite a few nuances and trade offs inherent to the decisions above, 
and my hope is that by hashing these choices out before presenting an 
implementation we can build buy-in from the Druid maintainer community that 
will result in this effort succeeding where the first effort failed.

Thanks,
Julian

Re: Spark-Druid Connectors

2021-06-27 Thread Julian Jaffe
Bimonthly ping for reviews :) I’m perfectly willing to hop on Slack or a video 
call to walk through the code and design as well if potential reviewers would 
find that helpful.

> On Apr 14, 2021, at 10:06 AM, Julian Jaffe  wrote:
> 
> 
> Hey Samarth,
> 
> I’m overjoyed to hear that! The PR is here: 
> https://github.com/apache/druid/pull/10920. I’ll add you as a reviewer as 
> well when I have a moment.
> 
> Thanks,
> Julian
> 
>>> On Apr 14, 2021, at 12:09 AM, Samarth Jain  wrote:
>>> 
>> Hi Julian,
>> 
>> I would be happy to review your Spark-Druid connector PRs. Ingesting data
>> into Druid using Spark SQL and Dataframe API is something we are very keen
>> to onboard.
>> Could you point me to them or alternatively add me as a reviewer?
>> 
>> - Samarth
>> 
>> On Tue, Apr 13, 2021 at 11:51 PM Julian Jaffe 
>> wrote:
>> 
>>> Hey Gian and other Druids,
>>> 
>>> Is there anything I can do to encourage reviews of this code? Would a dev
>>> guide or design doc be helpful to reviewers? Can I bribe someone with
>>> chocolate :)?
>>> 
>>> Thanks,
>>> Julian
>>> 
> On Mar 2, 2021, at 9:53 AM, Gian Merlino  wrote:
 
 Thank you!
 
> On Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe 
> wrote:
> 
> Hey Gian,
> 
> I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism
>>> was
> not driven by a lack of faith in the Druid community or the Druid
> committers but by the fact that these connectors may be an awkward fit
>>> in
> the Druid code base without more buy-in from the community writ large.
> 
> The information you’re asking for is spread across a few places. I’ll
> consolidate it into the PR, emphasizing the UX and the tests. I should
>>> have
> it up within a day or so.
> 
> Thanks,
> Julian
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
> 
> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>>> For additional commands, e-mail: dev-h...@druid.apache.org
>>> 
>>> 


Re: Spark-Druid Connectors

2021-04-14 Thread Julian Jaffe
Hey Samarth,

I’m overjoyed to hear that! The PR is here: 
https://github.com/apache/druid/pull/10920. I’ll add you as a reviewer as well 
when I have a moment.

Thanks,
Julian

> On Apr 14, 2021, at 12:09 AM, Samarth Jain  wrote:
> 
> Hi Julian,
> 
> I would be happy to review your Spark-Druid connector PRs. Ingesting data
> into Druid using Spark SQL and Dataframe API is something we are very keen
> to onboard.
> Could you point me to them or alternatively add me as a reviewer?
> 
> - Samarth
> 
>> On Tue, Apr 13, 2021 at 11:51 PM Julian Jaffe 
>> wrote:
>> 
>> Hey Gian and other Druids,
>> 
>> Is there anything I can do to encourage reviews of this code? Would a dev
>> guide or design doc be helpful to reviewers? Can I bribe someone with
>> chocolate :)?
>> 
>> Thanks,
>> Julian
>> 
 On Mar 2, 2021, at 9:53 AM, Gian Merlino  wrote:
>>> 
>>> Thank you!
>>> 
 On Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe 
 wrote:
 
 Hey Gian,
 
 I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism
>> was
 not driven by a lack of faith in the Druid community or the Druid
 committers but by the fact that these connectors may be an awkward fit
>> in
 the Druid code base without more buy-in from the community writ large.
 
 The information you’re asking for is spread across a few places. I’ll
 consolidate it into the PR, emphasizing the UX and the tests. I should
>> have
 it up within a day or so.
 
 Thanks,
 Julian
 -
 To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
 For additional commands, e-mail: dev-h...@druid.apache.org
 
 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> For additional commands, e-mail: dev-h...@druid.apache.org
>> 
>> 


Re: Spark-Druid Connectors

2021-04-14 Thread Samarth Jain
Hi Julian,

I would be happy to review your Spark-Druid connector PRs. Ingesting data
into Druid using Spark SQL and Dataframe API is something we are very keen
to onboard.
Could you point me to them or alternatively add me as a reviewer?

- Samarth

On Tue, Apr 13, 2021 at 11:51 PM Julian Jaffe 
wrote:

> Hey Gian and other Druids,
>
> Is there anything I can do to encourage reviews of this code? Would a dev
> guide or design doc be helpful to reviewers? Can I bribe someone with
> chocolate :)?
>
> Thanks,
> Julian
>
> > On Mar 2, 2021, at 9:53 AM, Gian Merlino  wrote:
> >
> > Thank you!
> >
> >> On Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe 
> >> wrote:
> >>
> >> Hey Gian,
> >>
> >> I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism
> was
> >> not driven by a lack of faith in the Druid community or the Druid
> >> committers but by the fact that these connectors may be an awkward fit
> in
> >> the Druid code base without more buy-in from the community writ large.
> >>
> >> The information you’re asking for is spread across a few places. I’ll
> >> consolidate it into the PR, emphasizing the UX and the tests. I should
> have
> >> it up within a day or so.
> >>
> >> Thanks,
> >> Julian
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >> For additional commands, e-mail: dev-h...@druid.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Spark-Druid Connectors

2021-04-14 Thread Julian Jaffe
Hey Gian and other Druids,

Is there anything I can do to encourage reviews of this code? Would a dev guide 
or design doc be helpful to reviewers? Can I bribe someone with chocolate :)?

Thanks,
Julian

> On Mar 2, 2021, at 9:53 AM, Gian Merlino  wrote:
> 
> Thank you!
> 
>> On Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe 
>> wrote:
>> 
>> Hey Gian,
>> 
>> I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism was
>> not driven by a lack of faith in the Druid community or the Druid
>> committers but by the fact that these connectors may be an awkward fit in
>> the Druid code base without more buy-in from the community writ large.
>> 
>> The information you’re asking for is spread across a few places. I’ll
>> consolidate it into the PR, emphasizing the UX and the tests. I should have
>> it up within a day or so.
>> 
>> Thanks,
>> Julian
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> For additional commands, e-mail: dev-h...@druid.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-Druid Connectors

2021-03-02 Thread Gian Merlino
Thank you!

On Thu, Feb 25, 2021 at 12:03 AM Julian Jaffe 
wrote:

> Hey Gian,
>
> I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism was
> not driven by a lack of faith in the Druid community or the Druid
> committers but by the fact that these connectors may be an awkward fit in
> the Druid code base without more buy-in from the community writ large.
>
> The information you’re asking for is spread across a few places. I’ll
> consolidate it into the PR, emphasizing the UX and the tests. I should have
> it up within a day or so.
>
> Thanks,
> Julian
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Spark-Druid Connectors

2021-02-25 Thread Julian Jaffe
Hey Gian,

I’d be overjoyed to be proven wrong! For what it’s worth, my pessimism was not 
driven by a lack of faith in the Druid community or the Druid committers but by 
the fact that these connectors may be an awkward fit in the Druid code base 
without more buy-in from the community writ large.

The information you’re asking for is spread across a few places. I’ll 
consolidate it into the PR, emphasizing the UX and the tests. I should have it 
up within a day or so.

Thanks,
Julian
-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-Druid Connectors

2021-02-23 Thread Gian Merlino
Hey Julian,

Your pessimism in this matter is understandable but regrettable!

It would be great to see this effort become part of mainline Druid. It is a
more maintainable approach than a separate repo, because it gets rid of the
risk of interface drift, and it makes sure that all the tests are run
whenever we do a Druid release. It's more upfront work for you (and for
us), but Spark and Druid are both important OSS projects and I think it is
good to encourage better integration between them. I have also written in
the past about the importance of us getting better at accepting
contributions (at https://s.apache.org/aqicd). It is not always easy, since
reviewing contributions takes time, and it is mostly done on a volunteer
basis. But I think if you are game to work with us on this one, let's try
to get it in. I say that out of pure idealism, not having looked at the
design or code at all 

In the mail I linked, I had written:

> For contributors, focusing on UX and tests means writing out (in natural
> language) how your patch changes user experience, and why you think this
> change is a good idea. It also means having good testing of the new stuff
> you're adding, and writing out (in natural language) why you think your
> tests cover all the important cases. Speaking as a person that has
reviewed
> a lot of code: these natural language descriptions are *very helpful*,
> especially when they add context to the patch. Don't make reviewers
> reverse-engineer your code to guess what you were thinking.

As I said, I haven't looked at your design doc or PR yet. But if they cover
the above stuff, could you please point me to the right places that have
the most up-to-date info, and I will put my money where my mouth is and
review them in the way that I suggested in that thread. (i.e., focusing on
user experience and test coverage.)

By the way, I think the mailing list chomped your links. I'll reproduce
them here.

1) Mailing list:
https://lists.apache.org/thread.html/r8219a7be0583ae3d9a2303fa7f21872782cf0703812a410bb62acfef%40%3Cdev.druid.apache.org%3E
2) Slack: https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600
3) GitHub: https://github.com/apache/druid/issues/9780
4) Pull request: https://github.com/apache/druid/pull/10920

On Tue, Feb 23, 2021 at 10:37 PM Julian Jaffe 
wrote:

>
> Hey Druids,
>
> Last April, there was some discussion on this mailing list, Slack, and
> GitHub around building Spark-Druid connectors. After working up a rough
> cut, the effort was dormant until a few weeks ago when I returned to it.
> I’ve opened a pull request for the connectors, but I don’t realistically
> expect it to be accepted. Am I too pessimistic in my assumptions here?
> Otherwise, what’s the best course of action - create a standalone repo and
> add a link in the Druid docs?
>
> Julian
>


Spark-Druid Connectors

2021-02-23 Thread Julian Jaffe

Hey Druids,

Last April, there was some discussion on this mailing list, Slack, and GitHub 
around building Spark-Druid connectors. After working up a rough cut, the 
effort was dormant until a few weeks ago when I returned to it. I’ve opened a 
pull request for the connectors, but I don’t realistically expect it to be 
accepted. Am I too pessimistic in my assumptions here? Otherwise, what’s the 
best course of action - create a standalone repo and add a link in the Druid 
docs?

Julian


Spark-Druid Connectors Proposal

2020-04-28 Thread Julian Jaffe
Hey all,

There have been ongoing discussions on this list and in Slack about
improving interoperability between Spark and Druid by creating Spark
connectors that can read from and write to Druid clusters. As these
discussions have begun to converge on a potential solution, I've opened a
proposal  laying out how we
can implement this functionality.

Thanks,
Julian