Re: Druid and machine learning

Roman Leventov Tue, 28 Jan 2020 00:40:25 -0800

However, I now see the Charles' point -- the data which is typically stored
in Druid rows is simple and is not something models are typically applied
to. Timeseries themselves (that is, the results of timeseries queries in
Druid) may be an input for anomaly detection or phase transition models,
but there is not point in applying them inside Druid.


One corner case is sketches which are time series, so models could be
applied to them individually.

On Tue, 28 Jan 2020 at 08:59, Roman Leventov <[email protected]> wrote:

> I was thinking about model training at Druid indexing side and evaluation
> at Druid querying side.
>
> The advantage Druid has over Spark at querying is faster row filtering
> thanks to bitset indexes. But since model evaluation is a pretty heavy
> operation (I suppose; does anyone has ballpark time estimates? how does it
> compare to Sketch update?) then row scanning may not be the bottleneck and
> therefore no significant reason to use Druid instead of just plugging Spark
> engine to Druid segments.
>
> At indexing side, Druid indexer may be considered a general-purpose job
> scheduler so that somebody who already has Druid may leverage it instead of
> setting up a separate Airflow scheduler.
>
> On Tue, 28 Jan 2020, 06:46 Charles Allen, <[email protected]> wrote:
>
>> >  it makes more sense to have tooling around Druid, to do slice and dice
>> the data that you need, and do the ml stuff in sklearn, or even in spark
>>
>> I agree with this sentiment. Druid as an execution engine is very good at
>> doing distributed aggregation (distributed reduce). What advantage does
>> Druid as an engine have that Spark does not for ML?
>>
>> Are you talking training or model evaluation? or any?
>>
>> It *might* be possible to have a likeness mechanism, whereby you can pass
>> in a model as a filter and aggregate on rows (dimension tuples?) that
>> match
>> the model by some minimum criteria, but I'm not really sure what utility
>> that would be. Maybe as a quick backtesting engine? I feel like I'm a
>> solution searching for a problem going down this route though.
>>
>>
>>
>>
>>
>>
>> On Mon, Jan 27, 2020 at 12:11 AM Driesprong, Fokko <[email protected]>
>> wrote:
>>
>> > > Vertica has it. Good idea to introduce it in Druid.
>> >
>> > I'm not sure if this is a valid argument. With this argument, you can
>> > introduce anything into Druid. I think it is good to be opinionated,
>> and as
>> > a community why we do or don't introduce ML possibilities into the
>> > software.
>> >
>> > For example, databases like Postgres and Bigquery allow users to do
>> simple
>> > regression models:
>> > https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro. I also
>> don't
>> > think it isn't that hard to introduce linear regression using gradient
>> > decent into Druid:
>> >
>> >
>> https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
>> > However,
>> > how many people are going to use this?
>> >
>> > For me, it makes more sense to have tooling around Druid, to do slice
>> and
>> > dice the data that you need, and do the ml stuff in sklearn, or even in
>> > spark. For example using https://github.com/druid-io/pydruid or having
>> the
>> > ability to use Spark to read directly from the deep storage.
>> >
>> > Introducing models using SP or UDF's is also a possibility, but here I
>> > share the concerns of Sayat when it comes to performance and
>> scalability.
>> >
>> > Cheers, Fokko
>> >
>> >
>> >
>> > Op za 25 jan. 2020 om 08:51 schreef Gaurav Bhatnagar <
>> [email protected]>:
>> >
>> > > +1
>> > >
>> > > Vertica has it. Good idea to introduce it in Druid.
>> > >
>> > > On Mon, Jan 13, 2020 at 12:52 AM Dusan Maric <[email protected]>
>> wrote:
>> > >
>> > > > +1
>> > > >
>> > > > That would be a great idea! Thanks for sharing this.
>> > > >
>> > > > Would just like to chime in on Druid + ML model cases: predictions
>> and
>> > > > anomaly detection on top of TensorFlow ❤
>> > > >
>> > > > Regards,
>> > > >
>> > > > On Fri, Jan 10, 2020 at 6:41 AM Roman Leventov <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Hello Druid developers, what do you think about the future of
>> Druid &
>> > > > > machine learning?
>> > > > >
>> > > > > Druid has been great at complex aggregations. Could (should?) It
>> make
>> > > > > inroads into ML? Perhaps aggregators which apply the rows against
>> > some
>> > > > > pre-trained model and summarize results.
>> > > > >
>> > > > > Should model training stay completely external to Druid, or it
>> could
>> > be
>> > > > > incorporated into Druid's data lifecycle on a conceptual level,
>> such
>> > > as a
>> > > > > recurring "indexing" task which stores the result (the model) in
>> > > Druid's
>> > > > > deep storage, the model automatically loaded on historical nodes
>> as
>> > > > needed
>> > > > > (just like segments) and certain aggregators pick up the latest
>> > model?
>> > > > >
>> > > > > Does this make any sense? In what cases Druid & ML will and will
>> not
>> > > work
>> > > > > well together, and ML should stay a Spark's prerogative?
>> > > > >
>> > > > > I would be very interested to hear any thoughts on the topic,
>> vague
>> > > ideas
>> > > > > and questions.
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Dušan Marić
>> > > > mob.: +381 64 1124779 | e-mail: [email protected] | skype:
>> themaric
>> > > >
>> > >
>> >
>>
>

Re: Druid and machine learning

Reply via email to