I'd guess the majority of users are just using Druid itself to process
Druid data, although there are a few people out there that export it into
other systems using techniques like the above.
On Wed, Feb 13, 2019 at 2:00 PM Rajiv Mordani
wrote:
> Am curious to know how people are generally proce
Am curious to know how people are generally processing data from druid? We want
to be able to spark processing in a distributed fashion using Dataframes.
- Rajiv
On 2/11/19, 1:04 PM, "Julian Jaffe" wrote:
Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet of
object
Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet of
objects parsed from the JSON (something like
`sparkSession.read.json(jsonStringRDD)`). You could hook this up to a Druid
response, but I would definitely recommend looking through the code that
Gian posted instead - it reads
Thanks Julian,
See some questions in-line:
On 2/6/19, 3:01 PM, "Julian Jaffe" wrote:
I think this question is going the other way (e.g. how to read data into
Spark, as opposed to into Druid). For that, the quickest and dirtiest
approach is probably to use Spark's json suppor
Ah, you're right. I misread the original question.
In that case, also try checking out:
https://github.com/implydata/druid-hadoop-inputformat, an unofficial Druid
InputFormat. Spark can use that to read Druid data into an RDD - check the
example in the README. It's also unofficial and, currently,
I think this question is going the other way (e.g. how to read data into
Spark, as opposed to into Druid). For that, the quickest and dirtiest
approach is probably to use Spark's json support to parse a Druid response.
You may also be able to repurpose some code from
https://github.com/SparklineDat
Hey Rajiv,
There's an unofficial Druid/Spark adapter at:
https://github.com/metamx/druid-spark-batch. If you want to stick to
official things, then the best approach would be to use Spark to write data
to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
native batch ingestion
Is there a best practice for how to load data from druid to use in a spark
batch job? I asked this question on the user alias but got no response hence
reposting here.
* Rajiv