Re: Spark-based ingestion into Druid

Rajiv Mordani Tue, 10 Mar 2020 09:09:23 -0700

As part of the requirements please include querying / reading from Spark as 
well. This is a high priority for us.


- Rajiv

On 3/10/20, 1:26 AM, "Oguzhan Mangir" <sosyalmedya.oguz...@gmail.com> wrote:

    What we will do for that? I think, we can start to write requirements and 
flows.

    On 2020/03/05 20:19:38, Julian Jaffe <jja...@pinterest.com.INVALID> wrote: 
    > Yeah, I think the primary objective here is a standalone writer from Spark
    > to Druid.
    > 
    > On Thu, Mar 5, 2020 at 11:43 AM itai yaffe <itai.ya...@gmail.com> wrote:
    > 
    > > Thanks Julian!
    > > I'm actually targeting for this connector to allow write capabilities 
(at
    > > least as a first phase), rather than focusing on read capabilities.
    > > Having said that, I definitely see the value (even for the use-cases in 
my
    > > company) of having a reader that queries S3 segments directly! Funny, we
    > > too have implemented a mechanism (although a very simple one) to get the
    > > locations of the segments through SegmentMetadataQueries, to allow
    > > batch-oriented queries to work with against the deep storage :)
    > >
    > > Anyway, as I said, I think we can focus on write capabilities for now, 
and
    > > worry about read capabilities later (if that's OK).
    > >
    > > On 2020/03/05 18:29:09, Julian Jaffe <jja...@pinterest.com.INVALID>
    > > wrote:
    > > > The spark-druid-connector you shared brings up another design 
decision we
    > > > should probably talk through. That connector effectively wraps an HTTP
    > > > query client with Spark plumbing. An alternative approach (and the 
one I
    > > > ended up building due to our business requirements) is to build a 
reader
    > > > that operates directly over the S3 segments, shifting load for what 
are
    > > > likely very large and non-interactive queries off Druid-specific 
hardware
    > > > (with the exception of a few SegmentMetadataQueries to get location
    > > info).
    > > >
    > > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <itai.ya...@gmail.com> 
wrote:
    > > >
    > > > > I'll let Julian answer, but in the meantime, I just wanted to point
    > > out we
    > > > > might be able to draw some inspiration from this Spark-Redshift
    > > connector (
    > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdatabricks%2Fspark-redshift%23scala&amp;data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&amp;sdata=JMUhEOgBz7ddweQ%2FZx2ODKJl1Q%2FUXrKURGlkltU4p3w%3D&amp;reserved=0).
    > > > > Though it's somewhat outdated, it probably can be used as a 
reference
    > > for
    > > > > this new Spark-Druid connector we're planning.
    > > > > Another project to look at is
    > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSharpRay%2Fspark-druid-connector&amp;data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&amp;sdata=wKQiHp8MdymdvZB3iOpNnI%2BC1uYyAcCpw3d5oJjJE9E%3D&amp;reserved=0.
    > > > >
    > > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
    > > sosyalmedya.oguz...@gmail.com>
    > > > > wrote:
    > > > > > I think second option would be better. Many people use spark for
    > > batch
    > > > > operations with isolated clusters. Me and my friends will taking 
time
    > > for
    > > > > that. Julian, can you share your experiences for that? After that, 
we
    > > can
    > > > > write our aims, requirements and flows easily.
    > > > > >
    > > > > > On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote:
    > > > > > > Hey,
    > > > > > > Per Gian's proposal, and following this thread in Druid user 
group
    > > (
    > > > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fforum%2F%23!topic%2Fdruid-user%2FFqAuDGc-rUM&amp;data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&amp;sdata=9dFNoEHC7qWOoc1PCBENXOwTnC5v7RyXT41PA1Hugek%3D&amp;reserved=0)
    > > and
    > > > > this
    > > > > > > thread in Druid Slack channel (
    > > > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fthe-asf.slack.com%2Farchives%2FCJ8D1JTB8%2Fp1581452302483600&amp;data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&amp;sdata=YrWbTt7GCZ6OJPKTKLemTqN7eMXEGtfqFxkGiT4MC6g%3D&amp;reserved=0),
    > > I'd
    > > > > like
    > > > > > > to start discussing the options of having Spark-based ingestion
    > > into
    > > > > Druid.
    > > > > > >
    > > > > > > There's already an old project (
    > > > > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7C98a7d3b214504747aac908d7c4ccbc2e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637194255930728172&amp;sdata=HplKPAboYAvUnJ%2BGJUF%2FRbmrGrCI5guUEA%2FdJ64O0b8%3D&amp;reserved=0)
    > > > > > > for that, so perhaps we can use that as a starting point.
    > > > > > >
    > > > > > > The thread on Slack suggested 2 approaches:
    > > > > > >
    > > > > > >    1. *Simply replacing the Hadoop MapReduce ingestion task* -
    > > having a
    > > > > > >    Spark batch job that ingests data into Druid, as a simple
    > > > > replacement of
    > > > > > >    the Hadoop MapReduce ingestion task.
    > > > > > >    Meaning - your data pipeline will have a Spark job to
    > > pre-process
    > > > > the
    > > > > > >    data (similar to what some of us have today), and another 
Spark
    > > job
    > > > > to read
    > > > > > >    the output of the previous job, and create Druid segments
    > > (again -
    > > > > > >    following the same pattern as the Hadoop MapReduce ingestion
    > > task).
    > > > > > >    2. *Druid output sink for Spark* - rather than having 2 
separate
    > > > > Spark
    > > > > > >    jobs, 1 for pre-processing the data and 1 for ingesting the 
data
    > > > > into
    > > > > > >    Druid, you'll have a single Spark job that pre-processes the
    > > data
    > > > > and
    > > > > > >    creates Druid segments directly, e.g
    > > > > sparkDataFrame.write.format("druid")
    > > > > > >    (as suggested by omngr on Slack).
    > > > > > >
    > > > > > >
    > > > > > > I personally prefer the 2nd approach - while it might be harder 
to
    > > > > > > implement, it seems the benefits are greater in this approach.
    > > > > > >
    > > > > > > I'd like to hear your thoughts and to start getting this ball
    > > rolling.
    > > > > > >
    > > > > > > Thanks,
    > > > > > >            Itai
    > > > > > >
    > > > > >
    > > > > > 
---------------------------------------------------------------------
    > > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    > > > > > For additional commands, e-mail: dev-h...@druid.apache.org
    > > > > >
    > > > > >
    > > > >
    > > > > 
---------------------------------------------------------------------
    > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    > > > > For additional commands, e-mail: dev-h...@druid.apache.org
    > > > >
    > > > >
    > > >
    > >
    > > ---------------------------------------------------------------------
    > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    > > For additional commands, e-mail: dev-h...@druid.apache.org
    > >
    > >
    > 

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
    For additional commands, e-mail: dev-h...@druid.apache.org

Re: Spark-based ingestion into Druid

Reply via email to