Am curious to know how people are generally processing data from druid? We want to be able to spark processing in a distributed fashion using Dataframes.
- Rajiv On 2/11/19, 1:04 PM, "Julian Jaffe" <[email protected]> wrote: Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet of objects parsed from the JSON (something like `sparkSession.read.json(jsonStringRDD)`). You could hook this up to a Druid response, but I would definitely recommend looking through the code that Gian posted instead - it reads data from deep storage instead of sending an HTTP request to the Druid cluster and waiting for the response. On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani <[email protected]> wrote: > Thanks Julian, > See some questions in-line: > > On 2/6/19, 3:01 PM, "Julian Jaffe" <[email protected]> wrote: > > I think this question is going the other way (e.g. how to read data > into > Spark, as opposed to into Druid). For that, the quickest and dirtiest > approach is probably to use Spark's json support to parse a Druid > response. > > [Rajiv] Can you please expand more here? > > You may also be able to repurpose some code from > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&sdata=9Uq3ox5hhes60fxfqMOxmjfQPZdwFrfSs7glVLTafs0%3D&reserved=0, > but I don't think > there's any official guidance on this. > > > > On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <[email protected]> wrote: > > > Hey Rajiv, > > > > There's an unofficial Druid/Spark adapter at: > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&sdata=OFHEl0qFx5g8csFcjz5qnfU67bw37reST%2BYY%2BqzDLk8%3D&reserved=0. > If you want to stick to > > official things, then the best approach would be to use Spark to > write data > > to HDFS or S3 and then ingest it into Druid using Druid's > Hadoop-based or > > native batch ingestion. (Or even write it to Kafka using Spark > Streaming > > and ingest from Kafka into Druid using Druid's Kafka indexing > service.) > > > > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani > <[email protected] > > > > > wrote: > > > > > Is there a best practice for how to load data from druid to use in > a > > spark > > > batch job? I asked this question on the user alias but got no > response > > > hence reposting here. > > > > > > > > > * Rajiv > > > > > > > >
