We should be organized for that. This is a big problem for all batch 
operations. Me and my friends will be taking time for that. Julian, can you 
share your experiences for that? After that, we can write our aims, 
requirements and flows.

On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote: 
> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
> 
> There's already an old project (https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
> 
> The thread on Slack suggested 2 approaches:
> 
>    1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>    Spark batch job that ingests data into Druid, as a simple replacement of
>    the Hadoop MapReduce ingestion task.
>    Meaning - your data pipeline will have a Spark job to pre-process the
>    data (similar to what some of us have today), and another Spark job to read
>    the output of the previous job, and create Druid segments (again -
>    following the same pattern as the Hadoop MapReduce ingestion task).
>    2. *Druid output sink for Spark* - rather than having 2 separate Spark
>    jobs, 1 for pre-processing the data and 1 for ingesting the data into
>    Druid, you'll have a single Spark job that pre-processes the data and
>    creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
>    (as suggested by omngr on Slack).
> 
> 
> I personally prefer the 2nd approach - while it might be harder to
> implement, it see

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Reply via email to