Hey, Per Gian's proposal, and following this thread in Druid user group ( https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this thread in Druid Slack channel ( https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like to start discussing the options of having Spark-based ingestion into Druid.
There's already an old project (https://github.com/metamx/druid-spark-batch) for that, so perhaps we can use that as a starting point. The thread on Slack suggested 2 approaches: 1. *Simply replacing the Hadoop MapReduce ingestion task* - having a Spark batch job that ingests data into Druid, as a simple replacement of the Hadoop MapReduce ingestion task. Meaning - your data pipeline will have a Spark job to pre-process the data (similar to what some of us have today), and another Spark job to read the output of the previous job, and create Druid segments (again - following the same pattern as the Hadoop MapReduce ingestion task). 2. *Druid output sink for Spark* - rather than having 2 separate Spark jobs, 1 for pre-processing the data and 1 for ingesting the data into Druid, you'll have a single Spark job that pre-processes the data and creates Druid segments directly, e.g sparkDataFrame.write.format("druid") (as suggested by omngr on Slack). I personally prefer the 2nd approach - while it might be harder to implement, it seems the benefits are greater in this approach. I'd like to hear your thoughts and to start getting this ball rolling. Thanks, Itai