Re: Spark-based ingestion into Druid

Julian Jaffe Mon, 27 Apr 2020 23:32:10 -0700

Github proposal <https://github.com/apache/druid/issues/9780>. I'll send a
separate email to the dev list in the morning as well.


On Thu, Apr 2, 2020 at 11:04 AM Julian Jaffe <jja...@pinterest.com> wrote:

> I had a few hours last night, so I worked up a rough cut of a spark reader
> <https://github.com/JulianJaffePinterest/druid/tree/spark_druid_connector>
> to help think through some design decisions. I've only run it locally and
> haven't attempted to hook it up to a real cluster or metadata instance yet,
> but in keeping with the Apache way I figured I'd share it now instead of
> waiting until I'd fully built it and then asking for feedback. Hopefully
> this can help us all reach consensus on what these connectors should look
> like.
>
> On Sun, Mar 22, 2020 at 3:19 AM itai yaffe <itai.ya...@gmail.com> wrote:
>
>> Hey everyone,
>> I created the initial design doc:
>> https://docs.google.com/document/d/112VsrCKhtqtUTph5yXMzsaoxtz9wX1U2poi1vxuDswY/edit?usp=sharing
>> It lays out the motivation and a few more details (as discussed on the
>> different channels).
>> Let’s start working on it together, and then we can get Gian’s review.
>>
>> BTW - the doc is currently open for everyone to edit, let me know if you
>> think I should change that.
>>
>> On 2020/03/11 22:33:19, itai yaffe <itai.ya...@gmail.com> wrote:
>> > Hey Rajiv,
>> > Can you please provide some details on the use-case of querying Druid
>> from
>> > Spark (e.g what type of queries, how big is the result set, and any
>> other
>> > information you think is relevant)?
>> >
>> > Thanks!
>> >
>> > On Tue, Mar 10, 2020 at 6:08 PM Rajiv Mordani
>> <rmord...@vmware.com.invalid>
>> > wrote:
>> >
>> > > As part of the requirements please include querying / reading from
>> Spark
>> > > as well. This is a high priority for us.
>> > >
>> > > - Rajiv
>> > >
>> > > On 3/10/20, 1:26 AM, "Oguzhan Mangir" <sosyalmedya.oguz...@gmail.com
>> >
>> > > wrote:
>> > >
>> > >     What we will do for that? I think, we can start to write
>> requirements
>> > > and flows.
>> > >
>> > >     On 2020/03/05 20:19:38, Julian Jaffe <jja...@pinterest.com.INVALID
>> >
>> > > wrote:
>> > >     > Yeah, I think the primary objective here is a standalone writer
>> from
>> > > Spark
>> > >     > to Druid.
>> > >     >
>> > >     > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe <
>> itai.ya...@gmail.com>
>> > > wrote:
>> > >     >
>> > >     > > Thanks Julian!
>> > >     > > I'm actually targeting for this connector to allow write
>> > > capabilities (at
>> > >     > > least as a first phase), rather than focusing on read
>> capabilities.
>> > >     > > Having said that, I definitely see the value (even for the
>> > > use-cases in my
>> > >     > > company) of having a reader that queries S3 segments directly!
>> > > Funny, we
>> > >     > > too have implemented a mechanism (although a very simple one)
>> to
>> > > get the
>> > >     > > locations of the segments through SegmentMetadataQueries, to
>> allow
>> > >     > > batch-oriented queries to work with against the deep storage
>> :)
>> > >     > >
>> > >     > > Anyway, as I said, I think we can focus on write capabilities
>> for
>> > > now, and
>> > >     > > worry about read capabilities later (if that's OK).
>> > >     > >
>> > >     > > On 2020/03/05 18:29:09, Julian Jaffe
>> <jja...@pinterest.com.INVALID
>> > > >
>> > >     > > wrote:
>> > >     > > > The spark-druid-connector you shared brings up another
>> design
>> > > decision we
>> > >     > > > should probably talk through. That connector effectively
>> wraps
>> > > an HTTP
>> > >     > > > query client with Spark plumbing. An alternative approach
>> (and
>> > > the one I
>> > >     > > > ended up building due to our business requirements) is to
>> build
>> > > a reader
>> > >     > > > that operates directly over the S3 segments, shifting load
>> for
>> > > what are
>> > >     > > > likely very large and non-interactive queries off
>> Druid-specific
>> > > hardware
>> > >     > > > (with the exception of a few SegmentMetadataQueries to get
>> > > location
>> > >     > > info).
>> > >     > > >
>> > >     > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <
>> itai.ya...@gmail.com>
>> > > wrote:
>> > >     > > >
>> > >     > > > > I'll let Julian answer, but in the meantime, I just
>> wanted to
>> > > point
>> > >     > > out we
>> > >     > > > > might be able to draw some inspiration from this
>> Spark-Redshift
>> > >     > > connector (
>> > >     > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&reserved=0
>> > > ).
>> > >     > > > > Though it's somewhat outdated, it probably can be used as
>> a
>> > > reference
>> > >     > > for
>> > >     > > > > this new Spark-Druid connector we're planning.
>> > >     > > > > Another project to look at is
>> > >     > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&reserved=0
>> > > .
>> > >     > > > >
>> > >     > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
>> > >     > > sosyalmedya.oguz...@gmail.com>
>> > >     > > > > wrote:
>> > >     > > > > > I think second option would be better. Many people use
>> spark
>> > > for
>> > >     > > batch
>> > >     > > > > operations with isolated clusters. Me and my friends will
>> > > taking time
>> > >     > > for
>> > >     > > > > that. Julian, can you share your experiences for that?
>> After
>> > > that, we
>> > >     > > can
>> > >     > > > > write our aims, requirements and flows easily.
>> > >     > > > > >
>> > >     > > > > > On 2020/02/26 13:26:13, itai yaffe <
>> itai.ya...@gmail.com>
>> > > wrote:
>> > >     > > > > > > Hey,
>> > >     > > > > > > Per Gian's proposal, and following this thread in
>> Druid
>> > > user group
>> > >     > > (
>> > >     > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&reserved=0
>> > > )
>> > >     > > and
>> > >     > > > > this
>> > >     > > > > > > thread in Druid Slack channel (
>> > >     > > > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&reserved=0
>> > > ),
>> > >     > > I'd
>> > >     > > > > like
>> > >     > > > > > > to start discussing the options of having Spark-based
>> > > ingestion
>> > >     > > into
>> > >     > > > > Druid.
>> > >     > > > > > >
>> > >     > > > > > > There's already an old project (
>> > >     > > > >
>> > >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&reserved=0
>> > > )
>> > >     > > > > > > for that, so perhaps we can use that as a starting
>> point.
>> > >     > > > > > >
>> > >     > > > > > > The thread on Slack suggested 2 approaches:
>> > >     > > > > > >
>> > >     > > > > > >    1. *Simply replacing the Hadoop MapReduce ingestion
>> > > task* -
>> > >     > > having a
>> > >     > > > > > >    Spark batch job that ingests data into Druid, as a
>> > > simple
>> > >     > > > > replacement of
>> > >     > > > > > >    the Hadoop MapReduce ingestion task.
>> > >     > > > > > >    Meaning - your data pipeline will have a Spark job
>> to
>> > >     > > pre-process
>> > >     > > > > the
>> > >     > > > > > >    data (similar to what some of us have today), and
>> > > another Spark
>> > >     > > job
>> > >     > > > > to read
>> > >     > > > > > >    the output of the previous job, and create Druid
>> > > segments
>> > >     > > (again -
>> > >     > > > > > >    following the same pattern as the Hadoop MapReduce
>> > > ingestion
>> > >     > > task).
>> > >     > > > > > >    2. *Druid output sink for Spark* - rather than
>> having 2
>> > > separate
>> > >     > > > > Spark
>> > >     > > > > > >    jobs, 1 for pre-processing the data and 1 for
>> ingesting
>> > > the data
>> > >     > > > > into
>> > >     > > > > > >    Druid, you'll have a single Spark job that
>> > > pre-processes the
>> > >     > > data
>> > >     > > > > and
>> > >     > > > > > >    creates Druid segments directly, e.g
>> > >     > > > > sparkDataFrame.write.format("druid")
>> > >     > > > > > >    (as suggested by omngr on Slack).
>> > >     > > > > > >
>> > >     > > > > > >
>> > >     > > > > > > I personally prefer the 2nd approach - while it might
>> be
>> > > harder to
>> > >     > > > > > > implement, it seems the benefits are greater in this
>> > > approach.
>> > >     > > > > > >
>> > >     > > > > > > I'd like to hear your thoughts and to start getting
>> this
>> > > ball
>> > >     > > rolling.
>> > >     > > > > > >
>> > >     > > > > > > Thanks,
>> > >     > > > > > >            Itai
>> > >     > > > > > >
>> > >     > > > > >
>> > >     > > > > >
>> > > ---------------------------------------------------------------------
>> > >     > > > > > To unsubscribe, e-mail:
>> dev-unsubscr...@druid.apache.org
>> > >     > > > > > For additional commands, e-mail:
>> dev-h...@druid.apache.org
>> > >     > > > > >
>> > >     > > > > >
>> > >     > > > >
>> > >     > > > >
>> > > ---------------------------------------------------------------------
>> > >     > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> > >     > > > > For additional commands, e-mail:
>> dev-h...@druid.apache.org
>> > >     > > > >
>> > >     > > > >
>> > >     > > >
>> > >     > >
>> > >     > >
>> > > ---------------------------------------------------------------------
>> > >     > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> > >     > > For additional commands, e-mail: dev-h...@druid.apache.org
>> > >     > >
>> > >     > >
>> > >     >
>> > >
>> > >
>>  ---------------------------------------------------------------------
>> > >     To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> > >     For additional commands, e-mail: dev-h...@druid.apache.org
>> > >
>> > >
>> > >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
>> For additional commands, e-mail: dev-h...@druid.apache.org
>>
>>

Re: Spark-based ingestion into Druid

Reply via email to