I'd guess the majority of users are just using Druid itself to process Druid data, although there are a few people out there that export it into other systems using techniques like the above.
On Wed, Feb 13, 2019 at 2:00 PM Rajiv Mordani <rmord...@vmware.com.invalid> wrote: > Am curious to know how people are generally processing data from druid? We > want to be able to spark processing in a distributed fashion using > Dataframes. > > - Rajiv > > On 2/11/19, 1:04 PM, "Julian Jaffe" <jja...@pinterest.com.INVALID> wrote: > > Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet > of > objects parsed from the JSON (something like > `sparkSession.read.json(jsonStringRDD)`). You could hook this up to a > Druid > response, but I would definitely recommend looking through the code > that > Gian posted instead - it reads data from deep storage instead of > sending an > HTTP request to the Druid cluster and waiting for the response. > > On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani > <rmord...@vmware.com.invalid> > wrote: > > > Thanks Julian, > > See some questions in-line: > > > > On 2/6/19, 3:01 PM, "Julian Jaffe" <jja...@pinterest.com.INVALID> > wrote: > > > > I think this question is going the other way (e.g. how to read > data > > into > > Spark, as opposed to into Druid). For that, the quickest and > dirtiest > > approach is probably to use Spark's json support to parse a Druid > > response. > > > > [Rajiv] Can you please expand more here? > > > > You may also be able to repurpose some code from > > > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&sdata=9Uq3ox5hhes60fxfqMOxmjfQPZdwFrfSs7glVLTafs0%3D&reserved=0 > , > > but I don't think > > there's any official guidance on this. > > > > > > > > On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <g...@apache.org> > wrote: > > > > > Hey Rajiv, > > > > > > There's an unofficial Druid/Spark adapter at: > > > > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&sdata=OFHEl0qFx5g8csFcjz5qnfU67bw37reST%2BYY%2BqzDLk8%3D&reserved=0 > . > > If you want to stick to > > > official things, then the best approach would be to use Spark > to > > write data > > > to HDFS or S3 and then ingest it into Druid using Druid's > > Hadoop-based or > > > native batch ingestion. (Or even write it to Kafka using Spark > > Streaming > > > and ingest from Kafka into Druid using Druid's Kafka indexing > > service.) > > > > > > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani > > <rmord...@vmware.com.invalid > > > > > > > wrote: > > > > > > > Is there a best practice for how to load data from druid to > use in > > a > > > spark > > > > batch job? I asked this question on the user alias but got no > > response > > > > hence reposting here. > > > > > > > > > > > > * Rajiv > > > > > > > > > > > > > > > >