I think for whatever approach we take, we'll need to expose a
OnHeapMemorySegmentWriteOutMediumFactory for OnHeapMemorySegmentWriteOutMedium
that parallels OffHeapMemorySegmentWriteOutMediumFactory. Although off heap
index building will be faster, it's very difficult to get most schedulers
to allocate off-heap resources correctly for Spark containers. I can likely
get a diff up in the next day or two.

On Wed, Feb 26, 2020 at 5:26 AM itai yaffe <itai.ya...@gmail.com> wrote:

> Hey,
> Per Gian's proposal, and following this thread in Druid user group (
> https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> thread in Druid Slack channel (
> https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd like
> to start discussing the options of having Spark-based ingestion into Druid.
>
> There's already an old project (
> https://github.com/metamx/druid-spark-batch)
> for that, so perhaps we can use that as a starting point.
>
> The thread on Slack suggested 2 approaches:
>
>    1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
>    Spark batch job that ingests data into Druid, as a simple replacement of
>    the Hadoop MapReduce ingestion task.
>    Meaning - your data pipeline will have a Spark job to pre-process the
>    data (similar to what some of us have today), and another Spark job to
> read
>    the output of the previous job, and create Druid segments (again -
>    following the same pattern as the Hadoop MapReduce ingestion task).
>    2. *Druid output sink for Spark* - rather than having 2 separate Spark
>    jobs, 1 for pre-processing the data and 1 for ingesting the data into
>    Druid, you'll have a single Spark job that pre-processes the data and
>    creates Druid segments directly, e.g
> sparkDataFrame.write.format("druid")
>    (as suggested by omngr on Slack).
>
>
> I personally prefer the 2nd approach - while it might be harder to
> implement, it seems the benefits are greater in this approach.
>
> I'd like to hear your thoughts and to start getting this ball rolling.
>
> Thanks,
>            Itai
>

Reply via email to