I think for whatever approach we take, we'll need to expose a OnHeapMemorySegmentWriteOutMediumFactory for OnHeapMemorySegmentWriteOutMedium that parallels OffHeapMemorySegmentWriteOutMediumFactory. Although off heap index building will be faster, it's very difficult to get most schedulers to allocate off-heap resources correctly for Spark containers. I can likely get a diff up in the next day or two.
On Wed, Feb 26, 2020 at 5:26 AM itai yaffe <itai.ya...@gmail.com> wrote: > Hey, > Per Gian's proposal, and following this thread in Druid user group ( > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this > thread in Druid Slack channel ( > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like > to start discussing the options of having Spark-based ingestion into Druid. > > There's already an old project ( > https://github.com/metamx/druid-spark-batch) > for that, so perhaps we can use that as a starting point. > > The thread on Slack suggested 2 approaches: > > 1. *Simply replacing the Hadoop MapReduce ingestion task* - having a > Spark batch job that ingests data into Druid, as a simple replacement of > the Hadoop MapReduce ingestion task. > Meaning - your data pipeline will have a Spark job to pre-process the > data (similar to what some of us have today), and another Spark job to > read > the output of the previous job, and create Druid segments (again - > following the same pattern as the Hadoop MapReduce ingestion task). > 2. *Druid output sink for Spark* - rather than having 2 separate Spark > jobs, 1 for pre-processing the data and 1 for ingesting the data into > Druid, you'll have a single Spark job that pre-processes the data and > creates Druid segments directly, e.g > sparkDataFrame.write.format("druid") > (as suggested by omngr on Slack). > > > I personally prefer the 2nd approach - while it might be harder to > implement, it seems the benefits are greater in this approach. > > I'd like to hear your thoughts and to start getting this ball rolling. > > Thanks, > Itai >