Hey,
Per Gian's proposal, and following this thread in Druid user group (
https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
thread in Druid Slack channel (
https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
to start discussing the options of having Spark-based ingestion into Druid.

There's already an old project (https://github.com/metamx/druid-spark-batch)
for that, so perhaps we can use that as a starting point.

The thread on Slack suggested 2 approaches:

   1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
   Spark batch job that ingests data into Druid, as a simple replacement of
   the Hadoop MapReduce ingestion task.
   Meaning - your data pipeline will have a Spark job to pre-process the
   data (similar to what some of us have today), and another Spark job to read
   the output of the previous job, and create Druid segments (again -
   following the same pattern as the Hadoop MapReduce ingestion task).
   2. *Druid output sink for Spark* - rather than having 2 separate Spark
   jobs, 1 for pre-processing the data and 1 for ingesting the data into
   Druid, you'll have a single Spark job that pre-processes the data and
   creates Druid segments directly, e.g sparkDataFrame.write.format("druid")
   (as suggested by omngr on Slack).


I personally prefer the 2nd approach - while it might be harder to
implement, it seems the benefits are greater in this approach.

I'd like to hear your thoughts and to start getting this ball rolling.

Thanks,
           Itai

Reply via email to