Re: Spark-based ingestion into Druid

Julian Jaffe Thu, 02 Apr 2020 11:05:06 -0700

I had a few hours last night, so I worked up a rough cut of a spark reader
<https://github.com/JulianJaffePinterest/druid/tree/spark_druid_connector>
to help think through some design decisions. I've only run it locally and
haven't attempted to hook it up to a real cluster or metadata instance yet,
but in keeping with the Apache way I figured I'd share it now instead of
waiting until I'd fully built it and then asking for feedback. Hopefully
this can help us all reach consensus on what these connectors should look
like.


On Sun, Mar 22, 2020 at 3:19 AM itai yaffe <[email protected]> wrote:

> Hey everyone,
> I created the initial design doc:
> https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit?usp=sharing
> It lays out the motivation and a few more details (as discussed on the
> different channels).
> Let’s start working on it together, and then we can get Gian’s review.
>
> BTW - the doc is currently open for everyone to edit, let me know if you
> think I should change that.
>
> On 2020/03/11 22:33:19, itai yaffe <[email protected]> wrote:
> > Hey Rajiv,
> > Can you please provide some details on the use-case of querying Druid
> from
> > Spark (e.g what type of queries, how big is the result set, and any other
> > information you think is relevant)?
> >
> > Thanks!
> >
> > On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani
> <[email protected]>
> > wrote:
> >
> > > As part of the requirements please include querying / reading from
> Spark
> > > as well. This is a high priority for us.
> > >
> > > - Rajiv
> > >
> > > On 3/10/20, 1:26 AM, "Oguzhan Mangir" <[email protected]>
> > > wrote:
> > >
> > >     What we will do for that? I think, we can start to write
> requirements
> > > and flows.
> > >
> > >     On 2020/03/05 20:19:38, Julian Jaffe <[email protected]
> >
> > > wrote:
> > >     > Yeah, I think the primary objective here is a standalone writer
> from
> > > Spark
> > >     > to Druid.
> > >     >
> > >     > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe <[email protected]
> >
> > > wrote:
> > >     >
> > >     > > Thanks Julian!
> > >     > > I'm actually targeting for this connector to allow write
> > > capabilities (at
> > >     > > least as a first phase), rather than focusing on read
> capabilities.
> > >     > > Having said that, I definitely see the value (even for the
> > > use-cases in my
> > >     > > company) of having a reader that queries S3 segments directly!
> > > Funny, we
> > >     > > too have implemented a mechanism (although a very simple one)
> to
> > > get the
> > >     > > locations of the segments through SegmentMetadataQueries, to
> allow
> > >     > > batch-oriented queries to work with against the deep storage :)
> > >     > >
> > >     > > Anyway, as I said, I think we can focus on write capabilities
> for
> > > now, and
> > >     > > worry about read capabilities later (if that's OK).
> > >     > >
> > >     > > On 2020/03/05 18:29:09, Julian Jaffe
> <[email protected]
> > > >
> > >     > > wrote:
> > >     > > > The spark-druid-connector you shared brings up another design
> > > decision we
> > >     > > > should probably talk through. That connector effectively
> wraps
> > > an HTTP
> > >     > > > query client with Spark plumbing. An alternative approach
> (and
> > > the one I
> > >     > > > ended up building due to our business requirements) is to
> build
> > > a reader
> > >     > > > that operates directly over the S3 segments, shifting load
> for
> > > what are
> > >     > > > likely very large and non-interactive queries off
> Druid-specific
> > > hardware
> > >     > > > (with the exception of a few SegmentMetadataQueries to get
> > > location
> > >     > > info).
> > >     > > >
> > >     > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <
> [email protected]>
> > > wrote:
> > >     > > >
> > >     > > > > I'll let Julian answer, but in the meantime, I just wanted
> to
> > > point
> > >     > > out we
> > >     > > > > might be able to draw some inspiration from this
> Spark-Redshift
> > >     > > connector (
> > >     > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
> > > ).
> > >     > > > > Though it's somewhat outdated, it probably can be used as a
> > > reference
> > >     > > for
> > >     > > > > this new Spark-Druid connector we're planning.
> > >     > > > > Another project to look at is
> > >     > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
> > > .
> > >     > > > >
> > >     > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > >     > > [email protected]>
> > >     > > > > wrote:
> > >     > > > > > I think second option would be better. Many people use
> spark
> > > for
> > >     > > batch
> > >     > > > > operations with isolated clusters. Me and my friends will
> > > taking time
> > >     > > for
> > >     > > > > that. Julian, can you share your experiences for that?
> After
> > > that, we
> > >     > > can
> > >     > > > > write our aims, requirements and flows easily.
> > >     > > > > >
> > >     > > > > > On 2020/02/26 13:26:13, itai yaffe <[email protected]
> >
> > > wrote:
> > >     > > > > > > Hey,
> > >     > > > > > > Per Gian's proposal, and following this thread in Druid
> > > user group
> > >     > > (
> > >     > > > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0
> > > )
> > >     > > and
> > >     > > > > this
> > >     > > > > > > thread in Druid Slack channel (
> > >     > > > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0
> > > ),
> > >     > > I'd
> > >     > > > > like
> > >     > > > > > > to start discussing the options of having Spark-based
> > > ingestion
> > >     > > into
> > >     > > > > Druid.
> > >     > > > > > >
> > >     > > > > > > There's already an old project (
> > >     > > > >
> > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0
> > > )
> > >     > > > > > > for that, so perhaps we can use that as a starting
> point.
> > >     > > > > > >
> > >     > > > > > > The thread on Slack suggested 2 approaches:
> > >     > > > > > >
> > >     > > > > > >    1. *Simply replacing the Hadoop MapReduce ingestion
> > > task* -
> > >     > > having a
> > >     > > > > > >    Spark batch job that ingests data into Druid, as a
> > > simple
> > >     > > > > replacement of
> > >     > > > > > >    the Hadoop MapReduce ingestion task.
> > >     > > > > > >    Meaning - your data pipeline will have a Spark job
> to
> > >     > > pre-process
> > >     > > > > the
> > >     > > > > > >    data (similar to what some of us have today), and
> > > another Spark
> > >     > > job
> > >     > > > > to read
> > >     > > > > > >    the output of the previous job, and create Druid
> > > segments
> > >     > > (again -
> > >     > > > > > >    following the same pattern as the Hadoop MapReduce
> > > ingestion
> > >     > > task).
> > >     > > > > > >    2. *Druid output sink for Spark* - rather than
> having 2
> > > separate
> > >     > > > > Spark
> > >     > > > > > >    jobs, 1 for pre-processing the data and 1 for
> ingesting
> > > the data
> > >     > > > > into
> > >     > > > > > >    Druid, you'll have a single Spark job that
> > > pre-processes the
> > >     > > data
> > >     > > > > and
> > >     > > > > > >    creates Druid segments directly, e.g
> > >     > > > > sparkDataFrame.write.format("druid")
> > >     > > > > > >    (as suggested by omngr on Slack).
> > >     > > > > > >
> > >     > > > > > >
> > >     > > > > > > I personally prefer the 2nd approach - while it might
> be
> > > harder to
> > >     > > > > > > implement, it seems the benefits are greater in this
> > > approach.
> > >     > > > > > >
> > >     > > > > > > I'd like to hear your thoughts and to start getting
> this
> > > ball
> > >     > > rolling.
> > >     > > > > > >
> > >     > > > > > > Thanks,
> > >     > > > > > >            Itai
> > >     > > > > > >
> > >     > > > > >
> > >     > > > > >
> > > ---------------------------------------------------------------------
> > >     > > > > > To unsubscribe, e-mail: [email protected]
> > >     > > > > > For additional commands, e-mail:
> [email protected]
> > >     > > > > >
> > >     > > > > >
> > >     > > > >
> > >     > > > >
> > > ---------------------------------------------------------------------
> > >     > > > > To unsubscribe, e-mail: [email protected]
> > >     > > > > For additional commands, e-mail: [email protected]
> > >     > > > >
> > >     > > > >
> > >     > > >
> > >     > >
> > >     > >
> > > ---------------------------------------------------------------------
> > >     > > To unsubscribe, e-mail: [email protected]
> > >     > > For additional commands, e-mail: [email protected]
> > >     > >
> > >     > >
> > >     >
> > >
> > >
>  ---------------------------------------------------------------------
> > >     To unsubscribe, e-mail: [email protected]
> > >     For additional commands, e-mail: [email protected]
> > >
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Spark-based ingestion into Druid

Reply via email to