I'll let Julian answer, but in the meantime, I just wanted to point out we 
might be able to draw some inspiration from this Spark-Redshift connector 
(https://github.com/databricks/spark-redshift#scala).
Though it's somewhat outdated, it probably can be used as a reference for this 
new Spark-Druid connector we're planning.
Another project to look at is https://github.com/SharpRay/spark-druid-connector.

On 2020/03/02 14:31:27, O��uzhan Mang��r <sosyalmedya.oguz...@gmail.com> wrote: 
> I think second option would be better. Many people use spark for batch 
> operations with isolated clusters. Me and my friends will taking time for 
> that. Julian, can you share your experiences for that? After that, we can 
> write our aims, requirements and flows easily. 
> 
> On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote: 
> > Hey,
> > Per Gian's proposal, and following this thread in Druid user group (
> > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> > thread in Druid Slack channel (
> > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> > to start discussing the options of having Spark-based ingestion into Druid.
> > 
> > There's already an old project (https://github.com/metamx/druid-spark-batch)
> > for that, so perhaps we can use that as a starting point.
> > 
> > The thread on Slack suggested 2 approaches:
> > 
> >    1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> >    Spark batch job that ingests data into Druid, as a simple replacement of
> >    the Hadoop MapReduce ingestion task.
> >    Meaning - your data pipeline will have a Spark job to pre-process the
> >    data (similar to what some of us have today), and another Spark job to 
> > read
> >    the output of the previous job, and create Druid segments (again -
> >    following the same pattern as the Hadoop MapReduce ingestion task).
> >    2. *Druid output sink for Spark* - rather than having 2 separate Spark
> >    jobs, 1 for pre-processing the data and 1 for ingesting the data into
> >    Druid, you'll have a single Spark job that pre-processes the data and
> >    creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
> >    (as suggested by omngr on Slack).
> > 
> > 
> > I personally prefer the 2nd approach - while it might be harder to
> > implement, it seems the benefits are greater in this approach.
> > 
> > I'd like to hear your thoughts and to start getting this ball rolling.
> > 
> > Thanks,
> >            Itai
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Reply via email to