Re: Spark-based ingestion into Druid

Oğuzhan Mangır Mon, 02 Mar 2020 08:57:29 -0800

On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote: 
> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
> 
> There's already an old project (https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
> 
> The thread on Slack suggested 2 approaches:
> 
>    1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>    Spark batch job that ingests data into Druid, as a simple replacement of
>    the Hadoop MapReduce ingestion task.
>    Meaning - your data pipeline will have a Spark job to pre-process the
>    data (similar to what some of us have today), and another Spark job to read
>    the output of the previous job, and create Druid segments (again -
>    following the same pattern as the Hadoop MapReduce ingestion task).
>    2. *Druid output sink for Spark* - rather than having 2 separate Spark
>    jobs, 1 for pre-processing the data and 1 for ingesting the data into
>    Druid, you'll have a single Spark job that pre-processes the data and
>    creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
>    (as suggested by omngr on Slack).
> 
> 
> I personally prefer the 2nd approach - while it might be harder to
> implement, it seems the benefits are greater in this approach.
> 
> I'd like to hear your thoughts and to start getting this ball rolling.
> 
> Thanks,
>            Itai
> 
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org
Re: Spark-based ingestion into Druid

Reply via email to