Re: Spark Druid connectors, take 2

2023-08-08 Thread Will Xu
For which version to target, I think we should survey the Druid community
and get input. In your case, which version are you currently deploying?
Historical experience tells me we should target current and current-1
(3.4.x and 3.3.x)

In terms of the writer (Spark writes to Druid), what's the user workflow
you envision? Would you think the user would trigger a spark job from
Druid? Or is this user who is submitting a Spark job to target a Druid
cluster? The former allows other systems, like compaction, for example, to
use Spark as a runner.

In terms of the reader (Spark reads Druid). I'm most curious to find out
what experience you are imagining. Should the reader be reading Druid
segment files or would the reader issue queries to Druid (maybe even to
historicals?) so that query can be parallelized?

Of the two, there is a lot more interest in the writer from the people I've
been talking to.

Regards,
Will


On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe  wrote:

> Hey all,
>
> There was talk earlier this year about resurrecting the effort to add
> direct Spark readers and writers to Druid. Rather than repeat the previous
> attempt and parachute in with updated connectors, I’d like to start by
> building a little more consensus around what the Druid dev community wants
> as potential maintainers.
>
> To begin with, I want to solicit opinions on two topics:
>
> Should these connectors be written in Scala or Java? The benefits of Scala
> would be that the existing connectors are written in Scala, as are most
> open source references for Spark Datasource V2 implementations. The
> benefits of Java are that Druid is written in Java, and so engineers
> interested in contributing to Druid wouldn’t need to switch between
> languages. Additionally, existing tooling, static checkers, etc. could be
> used with minimal effort, conforming code style and developer ergonomics
> across Druid instead of needing to keep an alternate Scala tool chain in
> sync.
> Which Spark version should this effort target? The most recently released
> version of Spark is 3.4.1. Should we aim to integrate with the latest Spark
> minor version under the assumption that this will give us the longest
> window of support, or should we build against an older minor line (3.3?
> 3.2?) since most Spark users tend to lag? For reference, there are
> currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From a
> user’s point of view, the API is mostly compatible across a major version
> (i.e. 3.x), while developer APIs such as the ones we would use to build
> these connectors can change between minor versions.
> There are quite a few nuances and trade offs inherent to the decisions
> above, and my hope is that by hashing these choices out before presenting
> an implementation we can build buy-in from the Druid maintainer community
> that will result in this effort succeeding where the first effort failed.
>
> Thanks,
> Julian


Spark Druid connectors, take 2

2023-08-08 Thread Julian Jaffe
Hey all,

There was talk earlier this year about resurrecting the effort to add direct 
Spark readers and writers to Druid. Rather than repeat the previous attempt and 
parachute in with updated connectors, I’d like to start by building a little 
more consensus around what the Druid dev community wants as potential 
maintainers.

To begin with, I want to solicit opinions on two topics:

Should these connectors be written in Scala or Java? The benefits of Scala 
would be that the existing connectors are written in Scala, as are most open 
source references for Spark Datasource V2 implementations. The benefits of Java 
are that Druid is written in Java, and so engineers interested in contributing 
to Druid wouldn’t need to switch between languages. Additionally, existing 
tooling, static checkers, etc. could be used with minimal effort, conforming 
code style and developer ergonomics across Druid instead of needing to keep an 
alternate Scala tool chain in sync.
Which Spark version should this effort target? The most recently released 
version of Spark is 3.4.1. Should we aim to integrate with the latest Spark 
minor version under the assumption that this will give us the longest window of 
support, or should we build against an older minor line (3.3? 3.2?) since most 
Spark users tend to lag? For reference, there are currently 3 stable Spark 
release versions, 3.2.4, 3.3.2, and 3.4.1. From a user’s point of view, the API 
is mostly compatible across a major version (i.e. 3.x), while developer APIs 
such as the ones we would use to build these connectors can change between 
minor versions.
There are quite a few nuances and trade offs inherent to the decisions above, 
and my hope is that by hashing these choices out before presenting an 
implementation we can build buy-in from the Druid maintainer community that 
will result in this effort succeeding where the first effort failed.

Thanks,
Julian

Re: [VOTE] Release Apache Druid 27.0.0 [RC1]

2023-08-08 Thread Karan Kumar
+1 (binding)

src package:

   - verified signature/checksum
   - Build druid on (m1 based chipset)
   - Ran druid cluster and tested
  - All MSQ demo q's
  - Tested query from deep storage with results written out to s3.
  - Ran q's against segment's not loaded on the historicals
  - Ran q's against segments loaded on the historicals.


binary package:

   - verified signature/checksum
   - LICENSE/NOTICE present
   - Ran druid cluster and tested
  - All MSQ demo q's
  - Tested query from deep storage with results written out to s3.
  - Ran q's against segment's not loaded on the historicals
  - Ran q's against segments loaded on the historicals.



docker :

   - verified checksum



On Mon, Aug 7, 2023 at 11:23 AM Abhishek Agarwal 
wrote:

> +1 (binding)
>
> src package:
> - verified signature/checksum
> - LICENSE/NOTICE present
> - built binary distribution,
>  - Loaded example Wikipedia dataset using MSQ and ran some
> queries
>  - Tested Kafka ingestion locally
>
> binary package:
> - verified signature/checksum
> - LICENSE/NOTICE present
> - built binary distribution,
>  - Loaded example wikipedia dataset using MSQ and ran some
> queries
>  - Tested Kafka ingestion locally
>
> docker:
> - verified checksum
> - started cluster with docker-compose, Loaded example Wikipedia dataset and
> ran some queries
> - Added the kafka extension to the environment, started cluster with
> docker-compose, and then tested kafka ingestion
>
> On Sun, Aug 6, 2023 at 1:43 PM Amatya Avadhanula 
> wrote:
>
> > Hi all,
> >
> > I have created a build for Apache Druid 27.0.0, release
> > candidate 1.
> >
> > Thanks to everyone who has helped contribute to the release! You can read
> > the proposed release notes here:
> > https://github.com/apache/druid/issues/14761
> >
> > The release candidate has been tagged in GitHub as
> > druid-27.0.0-rc1
> > available here:
> > https://github.com/apache/druid/tree/druid-27.0.0-rc1
> >
> > The artifacts to be voted on are located here:
> > https://dist.apache.org/repos/dist/dev/druid/27.0.0-rc1/
> >
> > A staged Maven repository is available for review at:
> > https://repository.apache.org/content/repositories/orgapachedruid-1044/
> >
> > Staged druid.apache.org website documentation is available here:
> > https://druid.staged.apache.org/docs/27.0.0/design/index.html
> >
> > A Docker image containing the binary of the release candidate can be
> > retrieved via:
> > docker pull apache/druid:27.0.0-rc1
> >
> > artifact checksums
> > src:
> >
> >
> a3a755d02e2ed55a125ba562de4b4ce467d27af1132a89ecbc73cbb4a38622f7534813267ed44de1ec0fb85227e93d04c1c1ce24959d6d5c41dedbf2d7c6e4ed
> > bin:
> >
> >
> b840ed0d77b1e5c11e058b161a56a469b0916febc1d478bf7ad2517cf79d2b724ece5cef21d96581122ba0958a5737f41e76496db3b10db5bfb4ab3123e4091b
> > docker: ca3df175bc944033c7c56ccf9499c05e2090ae6cefbdcd90095cfce2b7931ead
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/amatya.asc [To be available
> > within
> > 48 hours: https://issues.apache.org/jira/browse/INFRA-24865]
> >
> > This key and the key of other committers can also be found in the
> project's
> > KEYS file here:
> > https://dist.apache.org/repos/dist/release/druid/KEYS
> >
> > (If you are a committer, please feel free to add your own key to that
> file
> > by following the instructions in the file's header.)
> >
> >
> > Verify checksums:
> > diff <(shasum -a512 apache-druid-27.0.0-src.tar.gz | \
> > cut -d ' ' -f1) \
> > <(cat apache-druid-27.0.0-src.tar.gz.sha512 ; echo)
> >
> > diff <(shasum -a512 apache-druid-27.0.0-bin.tar.gz | \
> > cut -d ' ' -f1) \
> > <(cat apache-druid-27.0.0-bin.tar.gz.sha512 ; echo)
> >
> > Verify signatures:
> > gpg --verify apache-druid-27.0.0-src.tar.gz.asc \
> > apache-druid-27.0.0-src.tar.gz
> >
> > gpg --verify apache-druid-27.0.0-bin.tar.gz.asc \
> > apache-druid-27.0.0-bin.tar.gz
> >
> > Please review the proposed artifacts and vote. Note that Apache has
> > specific requirements that must be met before +1 binding votes can be
> cast
> > by PMC members. Please refer to the policy at
> > http://www.apache.org/legal/release-policy.html#policy for more details.
> >
> > As part of the validation process, the release artifacts can be generated
> > from source by running:
> > mvn clean install -Papache-release,dist -Dgpg.skip
> >
> > The RAT license check can be run from source by:
> > mvn apache-rat:check -Prat
> >
> > This vote will be open for at least 72 hours. The vote will pass if a
> > majority of at least three +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Druid 0.17.0
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> > [ ] -1 Do not release this package because...
> >
> > Thank you!
> >
>