Re: Spark-based ingestion into Druid

Oguzhan Mangir Tue, 10 Mar 2020 01:27:03 -0700

What we will do for that? I think, we can start to write requirements and flows.


On 2020/03/05 20:19:38, Julian Jaffe <jja...@pinterest.com.INVALID> wrote: 
> Yeah, I think the primary objective here is a standalone writer from Spark
> to Druid.
> 
> On Thu, Mar 5, 2020 at 11:43 AM itai yaffe <itai.ya...@gmail.com> wrote:
> 
> > Thanks Julian!
> > I'm actually targeting for this connector to allow write capabilities (at
> > least as a first phase), rather than focusing on read capabilities.
> > Having said that, I definitely see the value (even for the use-cases in my
> > company) of having a reader that queries S3 segments directly! Funny, we
> > too have implemented a mechanism (although a very simple one) to get the
> > locations of the segments through SegmentMetadataQueries, to allow
> > batch-oriented queries to work with against the deep storage :)
> >
> > Anyway, as I said, I think we can focus on write capabilities for now, and
> > worry about read capabilities later (if that's OK).
> >
> > On 2020/03/05 18:29:09, Julian Jaffe <jja...@pinterest.com.INVALID>
> > wrote:
> > > The spark-druid-connector you shared brings up another design decision we
> > > should probably talk through. That connector effectively wraps an HTTP
> > > query client with Spark plumbing. An alternative approach (and the one I
> > > ended up building due to our business requirements) is to build a reader
> > > that operates directly over the S3 segments, shifting load for what are
> > > likely very large and non-interactive queries off Druid-specific hardware
> > > (with the exception of a few SegmentMetadataQueries to get location
> > info).
> > >
> > > On Thu, Mar 5, 2020 at 8:04 AM itai yaffe <itai.ya...@gmail.com> wrote:
> > >
> > > > I'll let Julian answer, but in the meantime, I just wanted to point
> > out we
> > > > might be able to draw some inspiration from this Spark-Redshift
> > connector (
> > > > https://github.com/databricks/spark-redshift#scala).
> > > > Though it's somewhat outdated, it probably can be used as a reference
> > for
> > > > this new Spark-Druid connector we're planning.
> > > > Another project to look at is
> > > > https://github.com/SharpRay/spark-druid-connector.
> > > >
> > > > On 2020/03/02 14:31:27, O��uzhan Mang��r <
> > sosyalmedya.oguz...@gmail.com>
> > > > wrote:
> > > > > I think second option would be better. Many people use spark for
> > batch
> > > > operations with isolated clusters. Me and my friends will taking time
> > for
> > > > that. Julian, can you share your experiences for that? After that, we
> > can
> > > > write our aims, requirements and flows easily.
> > > > >
> > > > > On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote:
> > > > > > Hey,
> > > > > > Per Gian's proposal, and following this thread in Druid user group
> > (
> > > > > > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM)
> > and
> > > > this
> > > > > > thread in Druid Slack channel (
> > > > > > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600),
> > I'd
> > > > like
> > > > > > to start discussing the options of having Spark-based ingestion
> > into
> > > > Druid.
> > > > > >
> > > > > > There's already an old project (
> > > > https://github.com/metamx/druid-spark-batch)
> > > > > > for that, so perhaps we can use that as a starting point.
> > > > > >
> > > > > > The thread on Slack suggested 2 approaches:
> > > > > >
> > > > > >    1. *Simply replacing the Hadoop MapReduce ingestion task* -
> > having a
> > > > > >    Spark batch job that ingests data into Druid, as a simple
> > > > replacement of
> > > > > >    the Hadoop MapReduce ingestion task.
> > > > > >    Meaning - your data pipeline will have a Spark job to
> > pre-process
> > > > the
> > > > > >    data (similar to what some of us have today), and another Spark
> > job
> > > > to read
> > > > > >    the output of the previous job, and create Druid segments
> > (again -
> > > > > >    following the same pattern as the Hadoop MapReduce ingestion
> > task).
> > > > > >    2. *Druid output sink for Spark* - rather than having 2 separate
> > > > Spark
> > > > > >    jobs, 1 for pre-processing the data and 1 for ingesting the data
> > > > into
> > > > > >    Druid, you'll have a single Spark job that pre-processes the
> > data
> > > > and
> > > > > >    creates Druid segments directly, e.g
> > > > sparkDataFrame.write.format("druid")
> > > > > >    (as suggested by omngr on Slack).
> > > > > >
> > > > > >
> > > > > > I personally prefer the 2nd approach - while it might be harder to
> > > > > > implement, it seems the benefits are greater in this approach.
> > > > > >
> > > > > > I'd like to hear your thoughts and to start getting this ball
> > rolling.
> > > > > >
> > > > > > Thanks,
> > > > > >            Itai
> > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > > > For additional commands, e-mail: dev-h...@druid.apache.org
> > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Re: Spark-based ingestion into Druid

Reply via email to