Re: Spark batch with Druid

Gian Merlino Wed, 13 Feb 2019 19:48:29 -0800

I'd guess the majority of users are just using Druid itself to process
Druid data, although there are a few people out there that export it into
other systems using techniques like the above.


On Wed, Feb 13, 2019 at 2:00 PM Rajiv Mordani <rmord...@vmware.com.invalid>
wrote:

> Am curious to know how people are generally processing data from druid? We
> want to be able to spark processing in a distributed fashion using
> Dataframes.
>
> - Rajiv
>
> On 2/11/19, 1:04 PM, "Julian Jaffe" <jja...@pinterest.com.INVALID> wrote:
>
>     Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet
> of
>     objects parsed from the JSON (something like
>     `sparkSession.read.json(jsonStringRDD)`). You could hook this up to a
> Druid
>     response, but I would definitely recommend looking through the code
> that
>     Gian posted instead - it reads data from deep storage instead of
> sending an
>     HTTP request to the Druid cluster and waiting for the response.
>
>     On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani
> <rmord...@vmware.com.invalid>
>     wrote:
>
>     > Thanks Julian,
>     >         See some questions in-line:
>     >
>     > On 2/6/19, 3:01 PM, "Julian Jaffe" <jja...@pinterest.com.INVALID>
> wrote:
>     >
>     >     I think this question is going the other way (e.g. how to read
> data
>     > into
>     >     Spark, as opposed to into Druid). For that, the quickest and
> dirtiest
>     >     approach is probably to use Spark's json support to parse a Druid
>     > response.
>     >
>     > [Rajiv] Can you please expand more here?
>     >
>     >     You may also be able to repurpose some code from
>     >
>     >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&amp;data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&amp;sdata=9Uq3ox5hhes60fxfqMOxmjfQPZdwFrfSs7glVLTafs0%3D&amp;reserved=0
> ,
>     > but I don't think
>     >     there's any official guidance on this.
>     >
>     >
>     >
>     >     On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <g...@apache.org>
> wrote:
>     >
>     >     > Hey Rajiv,
>     >     >
>     >     > There's an unofficial Druid/Spark adapter at:
>     >     >
>     >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&amp;sdata=OFHEl0qFx5g8csFcjz5qnfU67bw37reST%2BYY%2BqzDLk8%3D&amp;reserved=0
> .
>     > If you want to stick to
>     >     > official things, then the best approach would be to use Spark
> to
>     > write data
>     >     > to HDFS or S3 and then ingest it into Druid using Druid's
>     > Hadoop-based or
>     >     > native batch ingestion. (Or even write it to Kafka using Spark
>     > Streaming
>     >     > and ingest from Kafka into Druid using Druid's Kafka indexing
>     > service.)
>     >     >
>     >     > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
>     > <rmord...@vmware.com.invalid
>     >     > >
>     >     > wrote:
>     >     >
>     >     > > Is there a best practice for how to load data from druid to
> use in
>     > a
>     >     > spark
>     >     > > batch job? I asked this question on the user alias but got no
>     > response
>     >     > > hence reposting here.
>     >     > >
>     >     > >
>     >     > >   *   Rajiv
>     >     > >
>     >     >
>     >
>     >
>     >
>
>
>

Re: Spark batch with Druid

Reply via email to