Re: Spark Druid connectors, take 2

2023-08-09 Thread Itai Yaffe
For proper disclosure, it's been a while since I used Druid, but here's my
2 cents w.r.t Will's question (based on what I originally wrote in this
design doc

):

   1. *Spark writes to Druid*:
  1. Based on what I've seen, the latter would be the more common
  choice, i.e. *I would assume most users would execute an external
  Spark job* (external to Druid, that is), e.g. from Databricks/EMR/...
  That job would process data, and write the output into Druid (in the
  form of Druid segments directly into Druid's deep storage, plus the
  required updates to Druid's metadata).
  2. If the community chooses to go down that route, I think it's also
  possible to execute other operations (e.g compaction) from external Spark
  jobs, since they are sort of ingestion jobs (if you think about it at a
  high level), as they read Druid segments and write new Druid segments.
   2. *Spark reads from Druid*:
  1. You can already issue queries to Druid from Spark using JDBC, so
  in this case, the more appealing option, I think, is *to be able to
  read segment files directly* (especially for extremely heavy queries).
  2. In addition, you'd need to implement the ability for Spark to read
  segment files directly, in order to support Druid->Druid ingestion (i.e
  where your input is another Druid datasource), as well as to support
  compaction tasks (IIRC).
   3. Generally speaking, I agree with your observation w.r.t the bigger
   interest in the writer.
   The only comment here is that some additional benefits of the writer
   (e.g Druid->Druid ingestion, support for compaction tasks) depend on
   implementing the reader.

Hope that helps 🙂

Thanks!

On Tue, 8 Aug 2023 at 19:27, Will Xu  wrote:

> For which version to target, I think we should survey the Druid community
> and get input. In your case, which version are you currently deploying?
> Historical experience tells me we should target current and current-1
> (3.4.x and 3.3.x)
>
> In terms of the writer (Spark writes to Druid), what's the user workflow
> you envision? Would you think the user would trigger a spark job from
> Druid? Or is this user who is submitting a Spark job to target a Druid
> cluster? The former allows other systems, like compaction, for example, to
> use Spark as a runner.
>
> In terms of the reader (Spark reads Druid). I'm most curious to find out
> what experience you are imagining. Should the reader be reading Druid
> segment files or would the reader issue queries to Druid (maybe even to
> historicals?) so that query can be parallelized?
>
> Of the two, there is a lot more interest in the writer from the people I've
> been talking to.
>
> Regards,
> Will
>
>
> On Tue, Aug 8, 2023 at 8:50 AM Julian Jaffe 
> wrote:
>
> > Hey all,
> >
> > There was talk earlier this year about resurrecting the effort to add
> > direct Spark readers and writers to Druid. Rather than repeat the
> previous
> > attempt and parachute in with updated connectors, I’d like to start by
> > building a little more consensus around what the Druid dev community
> wants
> > as potential maintainers.
> >
> > To begin with, I want to solicit opinions on two topics:
> >
> > Should these connectors be written in Scala or Java? The benefits of
> Scala
> > would be that the existing connectors are written in Scala, as are most
> > open source references for Spark Datasource V2 implementations. The
> > benefits of Java are that Druid is written in Java, and so engineers
> > interested in contributing to Druid wouldn’t need to switch between
> > languages. Additionally, existing tooling, static checkers, etc. could be
> > used with minimal effort, conforming code style and developer ergonomics
> > across Druid instead of needing to keep an alternate Scala tool chain in
> > sync.
> > Which Spark version should this effort target? The most recently released
> > version of Spark is 3.4.1. Should we aim to integrate with the latest
> Spark
> > minor version under the assumption that this will give us the longest
> > window of support, or should we build against an older minor line (3.3?
> > 3.2?) since most Spark users tend to lag? For reference, there are
> > currently 3 stable Spark release versions, 3.2.4, 3.3.2, and 3.4.1. From
> a
> > user’s point of view, the API is mostly compatible across a major version
> > (i.e. 3.x), while developer APIs such as the ones we would use to build
> > these connectors can change between minor versions.
> > There are quite a few nuances and trade offs inherent to the decisions
> > above, and my hope is that by hashing these choices out before presenting
> > an implementation we can build buy-in from the Druid maintainer community
> > that will result in this effort succeeding where the first effort failed.
> >
> > Thanks,
> > Julian
>


Re: [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Itai Yaffe
Hey Clint,
I think it's definitely a step in the right direction.
One thing I would suggest, since the are several deployments using Hadoop
(either for deep storage and/or for ingestion), is to let the wider
community know in advance that Hadoop 2.x support is going to be dropped in
favor of 3.x (so they have time to adjust their deployments accordingly).
If that sort of community-wide notification has already been done and I
missed it, please let me know.

Thanks!
  Itai

On Tue, Jun 8, 2021 at 11:08 AM Clint Wylie  wrote:

> Hi all,
>
> I've been assisting with some experiments to see how we might want to
> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
> can finally be free of some of the dependency issues it has been causing
> for as long as I can remember working with Druid.
>
> Hadoop 3 introduced shaded client jars,
> https://issues.apache.org/jira/browse/HADOOP-11804, with the purpose to
> allow applications to talk to the Hadoop cluster without drowning in its
> transitive dependencies. The experimental branch that I have been helping
> with, which is using these new shaded client jars, can be seen in this PR
> https://github.com/apache/druid/pull/11314, and is currently working with
> the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
> docs (which is pretty much equivalent to the HDFS integration test).
>
> The cloud deep storages still need some further testing and some minor
> cleanup still needs done for the docs and such. Additionally we still need
> to figure out how to handle the Kerberos extension, because it extends some
> Hadoop classes so isn't able to use the shaded client jars in a
> straight-forward manner, and so still has heavy dependencies and hasn't
> been tested. However, the experiment has started to pan out enough to where
> I think it is worth starting this discussion, because it does have some
> implications.
>
> Making this change I think will allow us to update our dependencies with a
> lot more freedom (I'm looking at you, Guava), but the catch is that once we
> make this change and start updating these dependencies, it will become
> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> there isn't an equivalent set of shaded client jars. I am also not certain
> how far back the Hadoop job classpath isolation stuff goes
> (mapreduce.job.classloader = true) which I think is required to be set on
> Druid tasks for this shaded stuff to work alongside updated Druid
> dependencies.
>
> Is anyone opposed to or worried about dropping Hadoop 2.x support after the
> Druid 0.22 release?
>


Re: Propose a scheme for Coordinator to pull metadata incrementally

2021-04-06 Thread Itai Yaffe
Hey,
I'm not a Druid developer, so it's quite possible I'm missing many
considerations here, but from a first glance, I like your offer, as it
resembles the *tsColumn *in JDBC lookups (
https://druid.apache.org/docs/latest/development/extensions-core/lookups-cached-global.html#jdbc-lookup
).

Anyway, just my 2 cents.

Thanks!
  Itai

On Tue, Apr 6, 2021 at 6:07 AM Benedict Jin  wrote:

> Hi all,
>
> Recently, when the Coordinator in our company's Druid cluster pulls
> metadata, there is a performance bottleneck. The main reason is the huge
> amount of metadata, which leads to a very slow process of scanning the full
> table of metadata storage and deserializing metadata. The size of the full
> metadata has been reduced through TTL, Compaction, Rollup, and etc., but
> the effect is not very significant. Therefore, I want to design a scheme
> for Coordinator to pull metadata incrementally, that is, each time
> Coordinator only pulls newly added metadata, so as to reduce the query
> pressure of metadata storage and the pressure of deserializing metadata.
> The general idea is to add a column last_update to the druid_segments table
> to record the update time of each record. Furthermore, when we query the
> metadata table, we can add filter conditions for the last_update column to
> avoid full table scan operations. Moreover, whether it is MySQL or
> PostgreSQL as the metadata storage medium, it can support
>  automatic update of the timestamp field, which is somewhat similar to the
> characteristics of triggers. So, have you encountered this problem before?
> If so, how did you solve it? In addition, do you have any suggestions or
> comments on the above incremental acquisition of metadata? Please let me
> know, thanks a lot.
>
> Regards,
> Benedict Jin
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>


Re: Draft April ASF Board Report

2020-04-09 Thread itai yaffe
Clint - sorry for that, my mistake. I was under the (false) impression
those are quarterly reports.
Thanks for the clarification, Gian!

On Thu, Apr 9, 2020 at 12:02 AM Clint Wylie  wrote:

> Hmm, I did the numbers as if it were a quarterly report this time, but I
> didn't include the stuff from the previous monthly reports and wasn't sure
> if it is necessary to go over it again. I think it is probably fine without
> it since the information was included in previous reports?
>
> On Wed, Apr 8, 2020 at 1:44 PM Gian Merlino  wrote:
>
> > It does matter! But, we mentioned those in a previous report (our last
> one
> > was just a month ago — so this one covers the last month). After this
> > report they'll start being quarterly and covering 3 months.
> >
> > On Wed, Apr 8, 2020 at 1:18 PM itai yaffe  wrote:
> >
> > > Hey,
> > > Not sure it matters, but we actually had at least 3 in-person meetups
> > > during Q1 organized by the Druid community:
> > >
> > >1. January 15th in London -
> > >https://www.meetup.com/Apache-Druid-London/events/267380924/
> > >2. January 28th in Athens -
> > >https://www.meetup.com/Athens-Big-Data/events/266900242/
> > >3. January 29th in Tel-Aviv (hosted by Nielsen) -
> > >
> > https://www.meetup.com/Big-things-are-happening-here/events/267578817/
> > >
> > > Sorry if I left out other events I'm not aware of...
> > >
> > > On Wed, Apr 8, 2020 at 4:05 AM Clint Wylie  wrote:
> > >
> > > > Hey all,
> > > >
> > > > I put together a draft for the quarterly ASF board report due
> tomorrow,
> > > > sorry for the short notice. Let me know if I missed anything or
> should
> > > make
> > > > any changes. Thanks!
> > > >
> > > > -
> > > >
> > > > ## Description
> > > >
> > > > Apache Druid is a high performance real-time analytics database. It
> is
> > > > designed for workflows where low-latency query and ingest are the
> main
> > > > requirements. It implements ingestion, storage, and querying
> > subsystems.
> > > > Users interface with Druid through built-in SQL and JSON APIs, as
> well
> > > > as third-party applications.
> > > >
> > > > Druid has an extensive web of connections with other Apache projects:
> > > > Calcite for SQL planning, Curator and ZooKeeper for coordination,
> Kafka
> > > > and Hadoop as data sources, Avro, ORC, or Parquet as supported data
> > input
> > > > formats, and DataSketches for scalable approximate algorithms. Druid
> > > > can also be used as a data source by Superset.
> > > >
> > > > ## Issues
> > > >
> > > > There are no issues requiring board attention at this time.
> > > >
> > > > ## Activity
> > > >
> > > > We are currently finishing up our 2nd post-graduation release,
> 0.18.0,
> > > > which we hope to have wrapped up and ready for release in the week of
> > > > April 13th. Additionally, we released 0.17.1 on April 1st, in
> response
> > > > to a vulnerabilty report recieved by the Apache Security Team. The
> > > > corresponding CVE is CVE-2020-1958 and details are available at
> > > > https://nvd.nist.gov/vuln/detail/CVE-2020-1958.
> > > >
> > > > To update on community happenings, since our last board report we
> > > > have had 1 virtual meetup which was a success, with an additional
> > > > virtual meetup scheduled for April 8th. Due to COVID-19, all
> in-person
> > > > meetups have been put on hold, in favor of virtual meetups. Likewise,
> > > > the Druid Summit event has been rescheduled for November 2-4, with a
> > > > smaller virtual event scheduled for April 15th.
> > > >
> > > > Mailing list activity is healthy with 156 emails on the dev list
> > > > (dev@druid.apache.org) over the last quarter. Our ASF slack channel,
> > > > #druid, has nearly 750 members, with daily activity of users asking
> for
> > > > and offering support to each other.
> > > >
> > > > ## Recent PMC changes
> > > >
> > > >  - Currently 27 PMC members.
> > > >  - No changes to PMC since graduation.
> > > >
> > > > ## Recent committer changes
> > > >
> > > >  - Currently 35 committers.
> > > >  - No recent changes to committers, the most recent addition was
> > > >Chi Cao Minh on 2020-01-15
> > > >
> > > > ## Recent releases
> > > >
> > > >  - 0.17.1, a security release, was released on April 1 2020
> > > >
> > > > ## Development activity by the numbers
> > > >
> > > > In the last quarter:
> > > >
> > > >  - 317 pull requests opened
> > > >  - 323 pull requests merged/closed
> > > >  - 178 issues opened
> > > >  - 112 issues closed
> > > >  - 878 comments on pull requests
> > > >  - 541 comments on issues
> > > >
> > >
> >
>


Re: Druid build supported with spark ?

2020-04-08 Thread itai yaffe
Hey,
Can you please explain what you mean by "supported to build with Spark"?

On Wed, Apr 8, 2020 at 8:43 AM 温利军  wrote:

>
>
> When druid   is it supported to build with spark
>
>
>
> 温利军
> M  18634732595
>
>


Re: Draft April ASF Board Report

2020-04-08 Thread itai yaffe
Hey,
Not sure it matters, but we actually had at least 3 in-person meetups
during Q1 organized by the Druid community:

   1. January 15th in London -
   https://www.meetup.com/Apache-Druid-London/events/267380924/
   2. January 28th in Athens -
   https://www.meetup.com/Athens-Big-Data/events/266900242/
   3. January 29th in Tel-Aviv (hosted by Nielsen) -
   https://www.meetup.com/Big-things-are-happening-here/events/267578817/

Sorry if I left out other events I'm not aware of...

On Wed, Apr 8, 2020 at 4:05 AM Clint Wylie  wrote:

> Hey all,
>
> I put together a draft for the quarterly ASF board report due tomorrow,
> sorry for the short notice. Let me know if I missed anything or should make
> any changes. Thanks!
>
> -
>
> ## Description
>
> Apache Druid is a high performance real-time analytics database. It is
> designed for workflows where low-latency query and ingest are the main
> requirements. It implements ingestion, storage, and querying subsystems.
> Users interface with Druid through built-in SQL and JSON APIs, as well
> as third-party applications.
>
> Druid has an extensive web of connections with other Apache projects:
> Calcite for SQL planning, Curator and ZooKeeper for coordination, Kafka
> and Hadoop as data sources, Avro, ORC, or Parquet as supported data input
> formats, and DataSketches for scalable approximate algorithms. Druid
> can also be used as a data source by Superset.
>
> ## Issues
>
> There are no issues requiring board attention at this time.
>
> ## Activity
>
> We are currently finishing up our 2nd post-graduation release, 0.18.0,
> which we hope to have wrapped up and ready for release in the week of
> April 13th. Additionally, we released 0.17.1 on April 1st, in response
> to a vulnerabilty report recieved by the Apache Security Team. The
> corresponding CVE is CVE-2020-1958 and details are available at
> https://nvd.nist.gov/vuln/detail/CVE-2020-1958.
>
> To update on community happenings, since our last board report we
> have had 1 virtual meetup which was a success, with an additional
> virtual meetup scheduled for April 8th. Due to COVID-19, all in-person
> meetups have been put on hold, in favor of virtual meetups. Likewise,
> the Druid Summit event has been rescheduled for November 2-4, with a
> smaller virtual event scheduled for April 15th.
>
> Mailing list activity is healthy with 156 emails on the dev list
> (dev@druid.apache.org) over the last quarter. Our ASF slack channel,
> #druid, has nearly 750 members, with daily activity of users asking for
> and offering support to each other.
>
> ## Recent PMC changes
>
>  - Currently 27 PMC members.
>  - No changes to PMC since graduation.
>
> ## Recent committer changes
>
>  - Currently 35 committers.
>  - No recent changes to committers, the most recent addition was
>Chi Cao Minh on 2020-01-15
>
> ## Recent releases
>
>  - 0.17.1, a security release, was released on April 1 2020
>
> ## Development activity by the numbers
>
> In the last quarter:
>
>  - 317 pull requests opened
>  - 323 pull requests merged/closed
>  - 178 issues opened
>  - 112 issues closed
>  - 878 comments on pull requests
>  - 541 comments on issues
>


Re: Spark-based ingestion into Druid

2020-03-22 Thread itai yaffe
Hey everyone,
I created the initial design doc: 
https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit?usp=sharing
It lays out the motivation and a few more details (as discussed on the 
different channels).
Let’s start working on it together, and then we can get Gian’s review.

BTW - the doc is currently open for everyone to edit, let me know if you think 
I should change that.

On 2020/03/11 22:33:19, itai yaffe  wrote: 
> Hey Rajiv,
> Can you please provide some details on the use-case of querying Druid from
> Spark (e.g what type of queries, how big is the result set, and any other
> information you think is relevant)?
> 
> Thanks!
> 
> On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani 
> wrote:
> 
> > As part of the requirements please include querying / reading from Spark
> > as well. This is a high priority for us.
> >
> > - Rajiv
> >
> > On 3/10/20, 1:26 AM, "Oguzhan Mangir" 
> > wrote:
> >
> > What we will do for that? I think, we can start to write requirements
> > and flows.
> >
> > On 2020/03/05 20:19:38, Julian Jaffe 
> > wrote:
> > > Yeah, I think the primary objective here is a standalone writer from
> > Spark
> > > to Druid.
> > >
> > > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe 
> > wrote:
> > >
> > > > Thanks Julian!
> > > > I'm actually targeting for this connector to allow write
> > capabilities (at
> > > > least as a first phase), rather than focusing on read capabilities.
> > > > Having said that, I definitely see the value (even for the
> > use-cases in my
> > > > company) of having a reader that queries S3 segments directly!
> > Funny, we
> > > > too have implemented a mechanism (although a very simple one) to
> > get the
> > > > locations of the segments through SegmentMetadataQueries, to allow
> > > > batch-oriented queries to work with against the deep storage :)
> > > >
> > > > Anyway, as I said, I think we can focus on write capabilities for
> > now, and
> > > > worry about read capabilities later (if that's OK).
> > > >
> > > > On 2020/03/05 18:29:09, Julian Jaffe  > >
> > > > wrote:
> > > > > The spark-druid-connector you shared brings up another design
> > decision we
> > > > > should probably talk through. That connector effectively wraps
> > an HTTP
> > > > > query client with Spark plumbing. An alternative approach (and
> > the one I
> > > > > ended up building due to our business requirements) is to build
> > a reader
> > > > > that operates directly over the S3 segments, shifting load for
> > what are
> > > > > likely very large and non-interactive queries off Druid-specific
> > hardware
> > > > > (with the exception of a few SegmentMetadataQueries to get
> > location
> > > > info).
> > > > >
> > > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe 
> > wrote:
> > > > >
> > > > > > I'll let Julian answer, but in the meantime, I just wanted to
> > point
> > > > out we
> > > > > > might be able to draw some inspiration from this Spark-Redshift
> > > > connector (
> > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
> > ).
> > > > > > Though it's somewhat outdated, it probably can be used as a
> > reference
> > > > for
> > > > > > this new Spark-Druid connector we're planning.
> > > > > > Another project to look at is
> > > > > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
> > .
> > > > > >
> > > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > 

Re: Spark-based ingestion into Druid

2020-03-11 Thread itai yaffe
Hey Rajiv,
Can you please provide some details on the use-case of querying Druid from
Spark (e.g what type of queries, how big is the result set, and any other
information you think is relevant)?

Thanks!

On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani 
wrote:

> As part of the requirements please include querying / reading from Spark
> as well. This is a high priority for us.
>
> - Rajiv
>
> On 3/10/20, 1:26 AM, "Oguzhan Mangir" 
> wrote:
>
> What we will do for that? I think, we can start to write requirements
> and flows.
>
> On 2020/03/05 20:19:38, Julian Jaffe 
> wrote:
> > Yeah, I think the primary objective here is a standalone writer from
> Spark
> > to Druid.
> >
> > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe 
> wrote:
> >
> > > Thanks Julian!
> > > I'm actually targeting for this connector to allow write
> capabilities (at
> > > least as a first phase), rather than focusing on read capabilities.
> > > Having said that, I definitely see the value (even for the
> use-cases in my
> > > company) of having a reader that queries S3 segments directly!
> Funny, we
> > > too have implemented a mechanism (although a very simple one) to
> get the
> > > locations of the segments through SegmentMetadataQueries, to allow
> > > batch-oriented queries to work with against the deep storage :)
> > >
> > > Anyway, as I said, I think we can focus on write capabilities for
> now, and
> > > worry about read capabilities later (if that's OK).
> > >
> > > On 2020/03/05 18:29:09, Julian Jaffe  >
> > > wrote:
> > > > The spark-druid-connector you shared brings up another design
> decision we
> > > > should probably talk through. That connector effectively wraps
> an HTTP
> > > > query client with Spark plumbing. An alternative approach (and
> the one I
> > > > ended up building due to our business requirements) is to build
> a reader
> > > > that operates directly over the S3 segments, shifting load for
> what are
> > > > likely very large and non-interactive queries off Druid-specific
> hardware
> > > > (with the exception of a few SegmentMetadataQueries to get
> location
> > > info).
> > > >
> > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe 
> wrote:
> > > >
> > > > > I'll let Julian answer, but in the meantime, I just wanted to
> point
> > > out we
> > > > > might be able to draw some inspiration from this Spark-Redshift
> > > connector (
> > > > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
> ).
> > > > > Though it's somewhat outdated, it probably can be used as a
> reference
> > > for
> > > > > this new Spark-Druid connector we're planning.
> > > > > Another project to look at is
> > > > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
> .
> > > > >
> > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > > sosyalmedya.oguz...@gmail.com>
> > > > > wrote:
> > > > > > I think second option would be better. Many people use spark
> for
> > > batch
> > > > > operations with isolated clusters. Me and my friends will
> taking time
> > > for
> > > > > that. Julian, can you share your experiences for that? After
> that, we
> > > can
> > > > > write our aims, requirements and flows easily.
> > > > > >
> > > > > > On 2020/02/26 13:26:13, itai yaffe 
> wrote:
> > > > > > > Hey,
> > > > > > > Per Gian's proposal, and following this thread in Druid
> user group
> > > (
> > > > > > >
> https://nam04.safelinks.

Re: Spark-based ingestion into Druid

2020-03-05 Thread itai yaffe
Thanks Julian!
I'm actually targeting for this connector to allow write capabilities (at least 
as a first phase), rather than focusing on read capabilities.
Having said that, I definitely see the value (even for the use-cases in my 
company) of having a reader that queries S3 segments directly! Funny, we too 
have implemented a mechanism (although a very simple one) to get the locations 
of the segments through SegmentMetadataQueries, to allow batch-oriented queries 
to work with against the deep storage :)

Anyway, as I said, I think we can focus on write capabilities for now, and 
worry about read capabilities later (if that's OK).

On 2020/03/05 18:29:09, Julian Jaffe  wrote: 
> The spark-druid-connector you shared brings up another design decision we
> should probably talk through. That connector effectively wraps an HTTP
> query client with Spark plumbing. An alternative approach (and the one I
> ended up building due to our business requirements) is to build a reader
> that operates directly over the S3 segments, shifting load for what are
> likely very large and non-interactive queries off Druid-specific hardware
> (with the exception of a few SegmentMetadataQueries to get location info).
> 
> On Thu, Mar 5, 2020 at 8:04 AM itai yaffe  wrote:
> 
> > I'll let Julian answer, but in the meantime, I just wanted to point out we
> > might be able to draw some inspiration from this Spark-Redshift connector (
> > https://github.com/databricks/spark-redshift#scala).
> > Though it's somewhat outdated, it probably can be used as a reference for
> > this new Spark-Druid connector we're planning.
> > Another project to look at is
> > https://github.com/SharpRay/spark-druid-connector.
> >
> > On 2020/03/02 14:31:27, O��uzhan Mang��r 
> > wrote:
> > > I think second option would be better. Many people use spark for batch
> > operations with isolated clusters. Me and my friends will taking time for
> > that. Julian, can you share your experiences for that? After that, we can
> > write our aims, requirements and flows easily.
> > >
> > > On 2020/02/26 13:26:13, itai yaffe  wrote:
> > > > Hey,
> > > > Per Gian's proposal, and following this thread in Druid user group (
> > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and
> > this
> > > > thread in Druid Slack channel (
> > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd
> > like
> > > > to start discussing the options of having Spark-based ingestion into
> > Druid.
> > > >
> > > > There's already an old project (
> > https://github.com/metamx/druid-spark-batch)
> > > > for that, so perhaps we can use that as a starting point.
> > > >
> > > > The thread on Slack suggested 2 approaches:
> > > >
> > > >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> > > >Spark batch job that ingests data into Druid, as a simple
> > replacement of
> > > >the Hadoop MapReduce ingestion task.
> > > >Meaning - your data pipeline will have a Spark job to pre-process
> > the
> > > >data (similar to what some of us have today), and another Spark job
> > to read
> > > >the output of the previous job, and create Druid segments (again -
> > > >following the same pattern as the Hadoop MapReduce ingestion task).
> > > >2. *Druid output sink for Spark* - rather than having 2 separate
> > Spark
> > > >jobs, 1 for pre-processing the data and 1 for ingesting the data
> > into
> > > >Druid, you'll have a single Spark job that pre-processes the data
> > and
> > > >creates Druid segments directly, e.g
> > sparkDataFrame.write.format("druid")
> > > >(as suggested by omngr on Slack).
> > > >
> > > >
> > > > I personally prefer the 2nd approach - while it might be harder to
> > > > implement, it seems the benefits are greater in this approach.
> > > >
> > > > I'd like to hear your thoughts and to start getting this ball rolling.
> > > >
> > > > Thanks,
> > > >Itai
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > For additional commands, e-mail: dev-h...@druid.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: Spark-based ingestion into Druid

2020-03-05 Thread itai yaffe
I'll let Julian answer, but in the meantime, I just wanted to point out we 
might be able to draw some inspiration from this Spark-Redshift connector 
(https://github.com/databricks/spark-redshift#scala).
Though it's somewhat outdated, it probably can be used as a reference for this 
new Spark-Druid connector we're planning.
Another project to look at is https://github.com/SharpRay/spark-druid-connector.

On 2020/03/02 14:31:27, O��uzhan Mang��r  wrote: 
> I think second option would be better. Many people use spark for batch 
> operations with isolated clusters. Me and my friends will taking time for 
> that. Julian, can you share your experiences for that? After that, we can 
> write our aims, requirements and flows easily. 
> 
> On 2020/02/26 13:26:13, itai yaffe  wrote: 
> > Hey,
> > Per Gian's proposal, and following this thread in Druid user group (
> > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> > thread in Druid Slack channel (
> > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> > to start discussing the options of having Spark-based ingestion into Druid.
> > 
> > There's already an old project (https://github.com/metamx/druid-spark-batch)
> > for that, so perhaps we can use that as a starting point.
> > 
> > The thread on Slack suggested 2 approaches:
> > 
> >1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> >Spark batch job that ingests data into Druid, as a simple replacement of
> >the Hadoop MapReduce ingestion task.
> >Meaning - your data pipeline will have a Spark job to pre-process the
> >data (similar to what some of us have today), and another Spark job to 
> > read
> >the output of the previous job, and create Druid segments (again -
> >following the same pattern as the Hadoop MapReduce ingestion task).
> >2. *Druid output sink for Spark* - rather than having 2 separate Spark
> >jobs, 1 for pre-processing the data and 1 for ingesting the data into
> >Druid, you'll have a single Spark job that pre-processes the data and
> >creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
> >(as suggested by omngr on Slack).
> > 
> > 
> > I personally prefer the 2nd approach - while it might be harder to
> > implement, it seems the benefits are greater in this approach.
> > 
> > I'd like to hear your thoughts and to start getting this ball rolling.
> > 
> > Thanks,
> >Itai
> > 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Spark-based ingestion into Druid

2020-02-26 Thread itai yaffe
Hey,
Per Gian's proposal, and following this thread in Druid user group (
https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
thread in Druid Slack channel (
https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
to start discussing the options of having Spark-based ingestion into Druid.

There's already an old project (https://github.com/metamx/druid-spark-batch)
for that, so perhaps we can use that as a starting point.

The thread on Slack suggested 2 approaches:

   1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
   Spark batch job that ingests data into Druid, as a simple replacement of
   the Hadoop MapReduce ingestion task.
   Meaning - your data pipeline will have a Spark job to pre-process the
   data (similar to what some of us have today), and another Spark job to read
   the output of the previous job, and create Druid segments (again -
   following the same pattern as the Hadoop MapReduce ingestion task).
   2. *Druid output sink for Spark* - rather than having 2 separate Spark
   jobs, 1 for pre-processing the data and 1 for ingesting the data into
   Druid, you'll have a single Spark job that pre-processes the data and
   creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
   (as suggested by omngr on Slack).


I personally prefer the 2nd approach - while it might be harder to
implement, it seems the benefits are greater in this approach.

I'd like to hear your thoughts and to start getting this ball rolling.

Thanks,
   Itai