Re: [DISCUSS] Opinionated Data Flows

2016-10-11 Thread Nick Allen
;> I disagree with the idea that Metron should not be responsible for
>> defining
>> data flows and I think that conflicts with the idea of abstracting out the
>> CEP component (Storm, Flink, etc).  There are patterns that emerge and
>> tricks the community finds through experience that should be baked in.  An
>> example of this is the enrichment topologies.  Grouping messages together
>> by enrichment keys before enrichment allows us to put a caching layer in
>> front which lightens the load on HBase and makes enrichment more
>> efficient.  If we put the responsibility of defining topologies on the
>> user, now they have to be an expert in tuning whatever CEP is chosen as
>> well as be knowledgable of established design patterns.  Maybe the current
>> state of Metron requires Storm tuning expertise anyways but I think we
>> should trend away from that and evolve Metron to be more capable of making
>> intelligent choices automatically.  I remember the early days of Hive
>> required careful consideration when writing queries to ensure the correct
>> joins where used, data was distributed evenly, etc.  Tuning Hive is easier
>> now because it has evolved to be able to make more of these choices
>> automatically without requiring users to have detailed knowledge of how
>> things work internally.
>>
>> Ryan Merriman
>>
>> On Fri, Oct 7, 2016 at 7:12 AM, Nick Allen <n...@nickallen.org> wrote:
>>
>> > Whether it is explicit or implicit, I think that would be one of the
>> major
>> > benefits of having the expressiveness of a DSL.  I can choose to have
>> some
>> > enrichments run in parallel (the split/join that you are referring to)
>> or
>> > have some enrichment runs serially.
>> >
>> > Having enrichments run serially is not something you can easily do with
>> > Metron today.  You cannot use the output of one enrichment as the input
>> to
>> > another.
>> >
>> > As a simple example, I have a blacklist of countries for which my
>> > organization should not be doing business.  I need to use the IP to find
>> > the location and then use the location to match against a blacklist.  I
>> > need these enrichments to run serially.
>> >
>> > source("netflow")
>> >   -> parser("Netflow")
>> >   -> exists("ip_src_addr")
>> >   -> src_country = geo["ip_src_addr"].country
>> >   -> is_alert = blacklist["src_country"]
>> >   ...
>> >
>> >
>> >
>> >
>> > On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfo...@hortonworks.com>
>> wrote:
>> >
>> > > Would splitting and joining be implicit or explicit, for multi-path
>> > > topologies?
>> > > 
>> > > From: zeo...@gmail.com <zeo...@gmail.com>
>> > > Sent: Thursday, October 06, 2016 11:03 AM
>> > > To: dev@metron.incubator.apache.org
>> > > Subject: Re: [DISCUSS] Opinionated Data Flows
>> > >
>> > > It should also be smart enough to handle an order like:
>> > >
>> > > source("bro")
>> > >   -> parser("BasicBroParser")
>> > >   -> exists("ip_src_addr")
>> > >   -> geo_ip_src = geo["ip_src_addr"]
>> > >   -> application = assets["ip_src_addr"].application
>> > >   -> owner = assets["ip_src_addr"].owner
>> > >   -> exists("ip_dst_addr")
>> > >   -> geo_ip_dst = geo["ip_dst_addr"]
>> > >   -> elasticsearch("bro-index")
>> > >
>> > > Without duplicate hits of the topologies.
>> > >
>> > > Jon
>> > >
>> > > On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <n...@nickallen.org> wrote:
>> > >
>> > > > Here is quick example with some hypothetical syntax.  Whatever that
>> > > syntax
>> > > > might be, it would be very simple, easy to understand, and leverage
>> > > > high-level concepts specific to Metron.
>> > > >
>> > > > This flow consumes Bro data, ensures there are valid
>> source/destination
>> > > > IPs, performs geo-enrichment, asset enrichment and finally persists
>> the
>> > > > data in Elasticsearch.
>> > > >
>> > > >
>> > > > source("bro")
>> > > >   -&

Re: [DISCUSS] Opinionated Data Flows

2016-10-10 Thread Ryan Merriman
I think this is a great discussion.  I especially like the DSL examples
that are given and think we should expand on that.  The good news is that
we are not far away from being able to actually implement it.  It's just a
matter of transforming that syntax into the zookeeper configs that drive
the topologies.  I think the underlying issue here is that the zookeeper
configs are not intuitive and are hard to work with.  Making them simpler
or adding a layer on top that makes them simpler is necessary in my
opinion.

As for the edge cases that have come up and are mentioned in this
thread ("parse
heterogenous data from a single topic" and "enriched output to land in
unique topics by sensor type"), a simple enhancement could solve both of
those.  Right now the output topic for parser and enrichment topologies are
either passed in when building the topology (flux or constructor args) or
retrieved from zookeeper.  This limits you to 1 output topic per topology.
Expanding the KafkaWriter class to optionally pull the output topic from a
field in a parsed message or have it passed in as an input parameter to the
write method should make it flexible enough to route messages to different
topics.  Also this statement is not entirely true:  "You cannot use the
output of one enrichment as the input to another".  You can if you use a
Stellar enrichment bolt and HBase enrichments.  Geo and host enrichments
would either need to be exposed through Stellar, or even better, converted
to HBase enrichments.

I disagree with the idea that Metron should not be responsible for defining
data flows and I think that conflicts with the idea of abstracting out the
CEP component (Storm, Flink, etc).  There are patterns that emerge and
tricks the community finds through experience that should be baked in.  An
example of this is the enrichment topologies.  Grouping messages together
by enrichment keys before enrichment allows us to put a caching layer in
front which lightens the load on HBase and makes enrichment more
efficient.  If we put the responsibility of defining topologies on the
user, now they have to be an expert in tuning whatever CEP is chosen as
well as be knowledgable of established design patterns.  Maybe the current
state of Metron requires Storm tuning expertise anyways but I think we
should trend away from that and evolve Metron to be more capable of making
intelligent choices automatically.  I remember the early days of Hive
required careful consideration when writing queries to ensure the correct
joins where used, data was distributed evenly, etc.  Tuning Hive is easier
now because it has evolved to be able to make more of these choices
automatically without requiring users to have detailed knowledge of how
things work internally.

Ryan Merriman

On Fri, Oct 7, 2016 at 7:12 AM, Nick Allen <n...@nickallen.org> wrote:

> Whether it is explicit or implicit, I think that would be one of the major
> benefits of having the expressiveness of a DSL.  I can choose to have some
> enrichments run in parallel (the split/join that you are referring to) or
> have some enrichment runs serially.
>
> Having enrichments run serially is not something you can easily do with
> Metron today.  You cannot use the output of one enrichment as the input to
> another.
>
> As a simple example, I have a blacklist of countries for which my
> organization should not be doing business.  I need to use the IP to find
> the location and then use the location to match against a blacklist.  I
> need these enrichments to run serially.
>
> source("netflow")
>   -> parser("Netflow")
>   -> exists("ip_src_addr")
>   -> src_country = geo["ip_src_addr"].country
>   -> is_alert = blacklist["src_country"]
>   ...
>
>
>
>
> On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfo...@hortonworks.com> wrote:
>
> > Would splitting and joining be implicit or explicit, for multi-path
> > topologies?
> > ________
> > From: zeo...@gmail.com <zeo...@gmail.com>
> > Sent: Thursday, October 06, 2016 11:03 AM
> > To: dev@metron.incubator.apache.org
> > Subject: Re: [DISCUSS] Opinionated Data Flows
> >
> > It should also be smart enough to handle an order like:
> >
> > source("bro")
> >   -> parser("BasicBroParser")
> >   -> exists("ip_src_addr")
> >   -> geo_ip_src = geo["ip_src_addr"]
> >   -> application = assets["ip_src_addr"].application
> >   -> owner = assets["ip_src_addr"].owner
> >   -> exists("ip_dst_addr")
> >   -> geo_ip_dst = geo["ip_dst_addr"]
> >   -> elasticsearch("bro-index")
> >
> > Without duplicate hits of the t

Re: [DISCUSS] Opinionated Data Flows

2016-10-07 Thread Nick Allen
Whether it is explicit or implicit, I think that would be one of the major
benefits of having the expressiveness of a DSL.  I can choose to have some
enrichments run in parallel (the split/join that you are referring to) or
have some enrichment runs serially.

Having enrichments run serially is not something you can easily do with
Metron today.  You cannot use the output of one enrichment as the input to
another.

As a simple example, I have a blacklist of countries for which my
organization should not be doing business.  I need to use the IP to find
the location and then use the location to match against a blacklist.  I
need these enrichments to run serially.

source("netflow")
  -> parser("Netflow")
  -> exists("ip_src_addr")
  -> src_country = geo["ip_src_addr"].country
  -> is_alert = blacklist["src_country"]
  ...




On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfo...@hortonworks.com> wrote:

> Would splitting and joining be implicit or explicit, for multi-path
> topologies?
> 
> From: zeo...@gmail.com <zeo...@gmail.com>
> Sent: Thursday, October 06, 2016 11:03 AM
> To: dev@metron.incubator.apache.org
> Subject: Re: [DISCUSS] Opinionated Data Flows
>
> It should also be smart enough to handle an order like:
>
> source("bro")
>   -> parser("BasicBroParser")
>   -> exists("ip_src_addr")
>   -> geo_ip_src = geo["ip_src_addr"]
>   -> application = assets["ip_src_addr"].application
>   -> owner = assets["ip_src_addr"].owner
>   -> exists("ip_dst_addr")
>   -> geo_ip_dst = geo["ip_dst_addr"]
>   -> elasticsearch("bro-index")
>
> Without duplicate hits of the topologies.
>
> Jon
>
> On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <n...@nickallen.org> wrote:
>
> > Here is quick example with some hypothetical syntax.  Whatever that
> syntax
> > might be, it would be very simple, easy to understand, and leverage
> > high-level concepts specific to Metron.
> >
> > This flow consumes Bro data, ensures there are valid source/destination
> > IPs, performs geo-enrichment, asset enrichment and finally persists the
> > data in Elasticsearch.
> >
> >
> > source("bro")
> >   -> parser("BasicBroParser")
> >   -> exists("ip_src_addr")
> >   -> exists("ip_dst_addr")
> >   -> geo_ip_src = geo["ip_src_addr"]
> >   -> geo_ip_dst = geo["ip_dst_addr"]
> >   -> application = assets["ip_src_addr"].application
> >   -> owner = assets["ip_src_addr"].owner
> >   -> elasticsearch("bro-index")
> >
> >
> >
> >
> > On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen <n...@nickallen.org> wrote:
> >
> > > Chasing this bad idea down even further leads me to something even
> > > crazier.
> > >
> > > Stellar 1.0 can only operate within a single topology and in most cases
> > > only on a single message.  Stellar 2.0 could be the mechanism that
> allows
> > > users to define their own data flows and what "useful bits of Metron
> > > functionality" get plugged-in.
> > >
> > > Once, you have a DSL that allows users to define what they want Metron
> to
> > > do, then the underlying implementation mechanism (which is currently
> > Storm)
> > > can also be swapped-out.  If we have an even faster Storm
> implementation,
> > > then we swap in the Storm NG engine.  Maybe we want Metron to also run
> in
> > > Flink, then we just swap-in a Flink engine.
> > >
> > >
> > >
> > >
> > > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen <n...@nickallen.org>
> wrote:
> > >
> > >> I totally "bird dogged the previous thread" as Casey likes to call it.
> > :)
> > >>  I am extracting this thought into a separate thread before I start
> > >> throwing out even more, crazier ideas.
> > >>
> > >> In general, Metron is very opinionated about data flows right now.  We
> > >>> have Parser topologies that feed an Enrichment topology, which then
> > feeds
> > >>> an Indexing topology.  We have useful bits of functionality (think
> > Stellar
> > >>> transforms, Geo enrichment, etc) that are closely coupled with these
> > >>> topologies (aka data flows).
> > >>>
> > >>
> > >>
> > >>> When a user wants to parse heterogenous data from a single topic,
> > that's
> > >>> not easy.  When a user wants enriched output to land in unique topics
> > by
> > >>> sensor type, well, that's also not easy.When a user wanted to
> skip
> > >>> enrichment of data sources, we actually re-architected the data flow
> > to add
> > >>> the Indexing topology.
> > >>>
> > >>
> > >>
> > >>> In an ideal world, a user should be responsible for defining the data
> > >>> flow, not Metron.  Metron should provide the "useful bits of
> > functionality"
> > >>> that a user can "plugin" wherever they like.  Metron itself should
> not
> > care
> > >>> how the data is moving or what step in the process it is at.
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Nick Allen <n...@nickallen.org>
> > >>
> > >
> > >
> > >
> > > --
> > > Nick Allen <n...@nickallen.org>
> > >
> >
> >
> >
> > --
> > Nick Allen <n...@nickallen.org>
> >
> --
>
> Jon
>



-- 
Nick Allen <n...@nickallen.org>


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread zeo...@gmail.com
In this case I would initially think implicit to simplify the configs.
Doesn't seem overly complicated to implement in my mind, but that doesn't
mean I'm not missing something regarding the current state or future
roadmap.

Jon

On Thu, Oct 6, 2016, 18:25 Matt Foley <mfo...@hortonworks.com> wrote:

> Would splitting and joining be implicit or explicit, for multi-path
> topologies?
> 
> From: zeo...@gmail.com <zeo...@gmail.com>
> Sent: Thursday, October 06, 2016 11:03 AM
> To: dev@metron.incubator.apache.org
> Subject: Re: [DISCUSS] Opinionated Data Flows
>
> It should also be smart enough to handle an order like:
>
> source("bro")
>   -> parser("BasicBroParser")
>   -> exists("ip_src_addr")
>   -> geo_ip_src = geo["ip_src_addr"]
>   -> application = assets["ip_src_addr"].application
>   -> owner = assets["ip_src_addr"].owner
>   -> exists("ip_dst_addr")
>   -> geo_ip_dst = geo["ip_dst_addr"]
>   -> elasticsearch("bro-index")
>
> Without duplicate hits of the topologies.
>
> Jon
>
> On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <n...@nickallen.org> wrote:
>
> > Here is quick example with some hypothetical syntax.  Whatever that
> syntax
> > might be, it would be very simple, easy to understand, and leverage
> > high-level concepts specific to Metron.
> >
> > This flow consumes Bro data, ensures there are valid source/destination
> > IPs, performs geo-enrichment, asset enrichment and finally persists the
> > data in Elasticsearch.
> >
> >
> > source("bro")
> >   -> parser("BasicBroParser")
> >   -> exists("ip_src_addr")
> >   -> exists("ip_dst_addr")
> >   -> geo_ip_src = geo["ip_src_addr"]
> >   -> geo_ip_dst = geo["ip_dst_addr"]
> >   -> application = assets["ip_src_addr"].application
> >   -> owner = assets["ip_src_addr"].owner
> >   -> elasticsearch("bro-index")
> >
> >
> >
> >
> > On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen <n...@nickallen.org> wrote:
> >
> > > Chasing this bad idea down even further leads me to something even
> > > crazier.
> > >
> > > Stellar 1.0 can only operate within a single topology and in most cases
> > > only on a single message.  Stellar 2.0 could be the mechanism that
> allows
> > > users to define their own data flows and what "useful bits of Metron
> > > functionality" get plugged-in.
> > >
> > > Once, you have a DSL that allows users to define what they want Metron
> to
> > > do, then the underlying implementation mechanism (which is currently
> > Storm)
> > > can also be swapped-out.  If we have an even faster Storm
> implementation,
> > > then we swap in the Storm NG engine.  Maybe we want Metron to also run
> in
> > > Flink, then we just swap-in a Flink engine.
> > >
> > >
> > >
> > >
> > > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen <n...@nickallen.org>
> wrote:
> > >
> > >> I totally "bird dogged the previous thread" as Casey likes to call it.
> > :)
> > >>  I am extracting this thought into a separate thread before I start
> > >> throwing out even more, crazier ideas.
> > >>
> > >> In general, Metron is very opinionated about data flows right now.  We
> > >>> have Parser topologies that feed an Enrichment topology, which then
> > feeds
> > >>> an Indexing topology.  We have useful bits of functionality (think
> > Stellar
> > >>> transforms, Geo enrichment, etc) that are closely coupled with these
> > >>> topologies (aka data flows).
> > >>>
> > >>
> > >>
> > >>> When a user wants to parse heterogenous data from a single topic,
> > that's
> > >>> not easy.  When a user wants enriched output to land in unique topics
> > by
> > >>> sensor type, well, that's also not easy.When a user wanted to
> skip
> > >>> enrichment of data sources, we actually re-architected the data flow
> > to add
> > >>> the Indexing topology.
> > >>>
> > >>
> > >>
> > >>> In an ideal world, a user should be responsible for defining the data
> > >>> flow, not Metron.  Metron should provide the "useful bits of
> > functionality"
> > >>> that a user can "plugin" wherever they like.  Metron itself should
> not
> > care
> > >>> how the data is moving or what step in the process it is at.
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Nick Allen <n...@nickallen.org>
> > >>
> > >
> > >
> > >
> > > --
> > > Nick Allen <n...@nickallen.org>
> > >
> >
> >
> >
> > --
> > Nick Allen <n...@nickallen.org>
> >
> --
>
> Jon
>
-- 

Jon


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread zeo...@gmail.com
It should also be smart enough to handle an order like:

source("bro")
  -> parser("BasicBroParser")
  -> exists("ip_src_addr")
  -> geo_ip_src = geo["ip_src_addr"]
  -> application = assets["ip_src_addr"].application
  -> owner = assets["ip_src_addr"].owner
  -> exists("ip_dst_addr")
  -> geo_ip_dst = geo["ip_dst_addr"]
  -> elasticsearch("bro-index")

Without duplicate hits of the topologies.

Jon

On Thu, Oct 6, 2016 at 1:55 PM Nick Allen  wrote:

> Here is quick example with some hypothetical syntax.  Whatever that syntax
> might be, it would be very simple, easy to understand, and leverage
> high-level concepts specific to Metron.
>
> This flow consumes Bro data, ensures there are valid source/destination
> IPs, performs geo-enrichment, asset enrichment and finally persists the
> data in Elasticsearch.
>
>
> source("bro")
>   -> parser("BasicBroParser")
>   -> exists("ip_src_addr")
>   -> exists("ip_dst_addr")
>   -> geo_ip_src = geo["ip_src_addr"]
>   -> geo_ip_dst = geo["ip_dst_addr"]
>   -> application = assets["ip_src_addr"].application
>   -> owner = assets["ip_src_addr"].owner
>   -> elasticsearch("bro-index")
>
>
>
>
> On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen  wrote:
>
> > Chasing this bad idea down even further leads me to something even
> > crazier.
> >
> > Stellar 1.0 can only operate within a single topology and in most cases
> > only on a single message.  Stellar 2.0 could be the mechanism that allows
> > users to define their own data flows and what "useful bits of Metron
> > functionality" get plugged-in.
> >
> > Once, you have a DSL that allows users to define what they want Metron to
> > do, then the underlying implementation mechanism (which is currently
> Storm)
> > can also be swapped-out.  If we have an even faster Storm implementation,
> > then we swap in the Storm NG engine.  Maybe we want Metron to also run in
> > Flink, then we just swap-in a Flink engine.
> >
> >
> >
> >
> > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen  wrote:
> >
> >> I totally "bird dogged the previous thread" as Casey likes to call it.
> :)
> >>  I am extracting this thought into a separate thread before I start
> >> throwing out even more, crazier ideas.
> >>
> >> In general, Metron is very opinionated about data flows right now.  We
> >>> have Parser topologies that feed an Enrichment topology, which then
> feeds
> >>> an Indexing topology.  We have useful bits of functionality (think
> Stellar
> >>> transforms, Geo enrichment, etc) that are closely coupled with these
> >>> topologies (aka data flows).
> >>>
> >>
> >>
> >>> When a user wants to parse heterogenous data from a single topic,
> that's
> >>> not easy.  When a user wants enriched output to land in unique topics
> by
> >>> sensor type, well, that's also not easy.When a user wanted to skip
> >>> enrichment of data sources, we actually re-architected the data flow
> to add
> >>> the Indexing topology.
> >>>
> >>
> >>
> >>> In an ideal world, a user should be responsible for defining the data
> >>> flow, not Metron.  Metron should provide the "useful bits of
> functionality"
> >>> that a user can "plugin" wherever they like.  Metron itself should not
> care
> >>> how the data is moving or what step in the process it is at.
> >>
> >>
> >>
> >>
> >> --
> >> Nick Allen 
> >>
> >
> >
> >
> > --
> > Nick Allen 
> >
>
>
>
> --
> Nick Allen 
>
-- 

Jon


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread Nick Allen
Here is quick example with some hypothetical syntax.  Whatever that syntax
might be, it would be very simple, easy to understand, and leverage
high-level concepts specific to Metron.

This flow consumes Bro data, ensures there are valid source/destination
IPs, performs geo-enrichment, asset enrichment and finally persists the
data in Elasticsearch.


source("bro")
  -> parser("BasicBroParser")
  -> exists("ip_src_addr")
  -> exists("ip_dst_addr")
  -> geo_ip_src = geo["ip_src_addr"]
  -> geo_ip_dst = geo["ip_dst_addr"]
  -> application = assets["ip_src_addr"].application
  -> owner = assets["ip_src_addr"].owner
  -> elasticsearch("bro-index")




On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen  wrote:

> Chasing this bad idea down even further leads me to something even
> crazier.
>
> Stellar 1.0 can only operate within a single topology and in most cases
> only on a single message.  Stellar 2.0 could be the mechanism that allows
> users to define their own data flows and what "useful bits of Metron
> functionality" get plugged-in.
>
> Once, you have a DSL that allows users to define what they want Metron to
> do, then the underlying implementation mechanism (which is currently Storm)
> can also be swapped-out.  If we have an even faster Storm implementation,
> then we swap in the Storm NG engine.  Maybe we want Metron to also run in
> Flink, then we just swap-in a Flink engine.
>
>
>
>
> On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen  wrote:
>
>> I totally "bird dogged the previous thread" as Casey likes to call it. :)
>>  I am extracting this thought into a separate thread before I start
>> throwing out even more, crazier ideas.
>>
>> In general, Metron is very opinionated about data flows right now.  We
>>> have Parser topologies that feed an Enrichment topology, which then feeds
>>> an Indexing topology.  We have useful bits of functionality (think Stellar
>>> transforms, Geo enrichment, etc) that are closely coupled with these
>>> topologies (aka data flows).
>>>
>>
>>
>>> When a user wants to parse heterogenous data from a single topic, that's
>>> not easy.  When a user wants enriched output to land in unique topics by
>>> sensor type, well, that's also not easy.When a user wanted to skip
>>> enrichment of data sources, we actually re-architected the data flow to add
>>> the Indexing topology.
>>>
>>
>>
>>> In an ideal world, a user should be responsible for defining the data
>>> flow, not Metron.  Metron should provide the "useful bits of functionality"
>>> that a user can "plugin" wherever they like.  Metron itself should not care
>>> how the data is moving or what step in the process it is at.
>>
>>
>>
>>
>> --
>> Nick Allen 
>>
>
>
>
> --
> Nick Allen 
>



-- 
Nick Allen 


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread Nick Allen
Personally, I was seeing METRON-477 as one of those "useful bits of
functionality" that would be orchestrated by Stellar 2.0.  But I can also
see your viewpoint on how it could also be part of the orchestration.  Very
interesting.



On Thu, Oct 6, 2016 at 1:09 PM, zeo...@gmail.com  wrote:

> One of those users gives this a +1.  This also appears related to
> METRON-477
> , except that 477 is
> more
> focused on data flow once it hits disk and this is during ingest/stream
> processing.  At the end of the day, not that different IMO.  Would love to
> see it all managed via Stellar/zookeeper.
>
> Jon
>
> On Thu, Oct 6, 2016 at 1:00 PM Nick Allen  wrote:
>
> In reality, the current "engine" is Storm + Kafka + HBase.  Each of these
> could be independently swapped out once Metron is just a DSL with multiple
> underlying engines.
>
> Ok, I'll stop.
>
> On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen  wrote:
>
> > Chasing this bad idea down even further leads me to something even
> > crazier.
> >
> > Stellar 1.0 can only operate within a single topology and in most cases
> > only on a single message.  Stellar 2.0 could be the mechanism that allows
> > users to define their own data flows and what "useful bits of Metron
> > functionality" get plugged-in.
> >
> > Once, you have a DSL that allows users to define what they want Metron to
> > do, then the underlying implementation mechanism (which is currently
> Storm)
> > can also be swapped-out.  If we have an even faster Storm implementation,
> > then we swap in the Storm NG engine.  Maybe we want Metron to also run in
> > Flink, then we just swap-in a Flink engine.
> >
> >
> >
> >
> > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen  wrote:
> >
> >> I totally "bird dogged the previous thread" as Casey likes to call it.
> :)
> >>  I am extracting this thought into a separate thread before I start
> >> throwing out even more, crazier ideas.
> >>
> >> In general, Metron is very opinionated about data flows right now.  We
> >>> have Parser topologies that feed an Enrichment topology, which then
> feeds
> >>> an Indexing topology.  We have useful bits of functionality (think
> Stellar
> >>> transforms, Geo enrichment, etc) that are closely coupled with these
> >>> topologies (aka data flows).
> >>>
> >>
> >>
> >>> When a user wants to parse heterogenous data from a single topic,
> that's
> >>> not easy.  When a user wants enriched output to land in unique topics
> by
> >>> sensor type, well, that's also not easy.When a user wanted to skip
> >>> enrichment of data sources, we actually re-architected the data flow to
> add
> >>> the Indexing topology.
> >>>
> >>
> >>
> >>> In an ideal world, a user should be responsible for defining the data
> >>> flow, not Metron.  Metron should provide the "useful bits of
> functionality"
> >>> that a user can "plugin" wherever they like.  Metron itself should not
> care
> >>> how the data is moving or what step in the process it is at.
> >>
> >>
> >>
> >>
> >> --
> >> Nick Allen 
> >>
> >
> >
> >
> > --
> > Nick Allen 
> >
>
>
>
> --
> Nick Allen 
>
> --
>
> Jon
>



-- 
Nick Allen 


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread zeo...@gmail.com
One of those users gives this a +1.  This also appears related to METRON-477
, except that 477 is more
focused on data flow once it hits disk and this is during ingest/stream
processing.  At the end of the day, not that different IMO.  Would love to
see it all managed via Stellar/zookeeper.

Jon

On Thu, Oct 6, 2016 at 1:00 PM Nick Allen  wrote:

In reality, the current "engine" is Storm + Kafka + HBase.  Each of these
could be independently swapped out once Metron is just a DSL with multiple
underlying engines.

Ok, I'll stop.

On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen  wrote:

> Chasing this bad idea down even further leads me to something even
> crazier.
>
> Stellar 1.0 can only operate within a single topology and in most cases
> only on a single message.  Stellar 2.0 could be the mechanism that allows
> users to define their own data flows and what "useful bits of Metron
> functionality" get plugged-in.
>
> Once, you have a DSL that allows users to define what they want Metron to
> do, then the underlying implementation mechanism (which is currently
Storm)
> can also be swapped-out.  If we have an even faster Storm implementation,
> then we swap in the Storm NG engine.  Maybe we want Metron to also run in
> Flink, then we just swap-in a Flink engine.
>
>
>
>
> On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen  wrote:
>
>> I totally "bird dogged the previous thread" as Casey likes to call it. :)
>>  I am extracting this thought into a separate thread before I start
>> throwing out even more, crazier ideas.
>>
>> In general, Metron is very opinionated about data flows right now.  We
>>> have Parser topologies that feed an Enrichment topology, which then
feeds
>>> an Indexing topology.  We have useful bits of functionality (think
Stellar
>>> transforms, Geo enrichment, etc) that are closely coupled with these
>>> topologies (aka data flows).
>>>
>>
>>
>>> When a user wants to parse heterogenous data from a single topic, that's
>>> not easy.  When a user wants enriched output to land in unique topics by
>>> sensor type, well, that's also not easy.When a user wanted to skip
>>> enrichment of data sources, we actually re-architected the data flow to
add
>>> the Indexing topology.
>>>
>>
>>
>>> In an ideal world, a user should be responsible for defining the data
>>> flow, not Metron.  Metron should provide the "useful bits of
functionality"
>>> that a user can "plugin" wherever they like.  Metron itself should not
care
>>> how the data is moving or what step in the process it is at.
>>
>>
>>
>>
>> --
>> Nick Allen 
>>
>
>
>
> --
> Nick Allen 
>



--
Nick Allen 

-- 

Jon


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread Nick Allen
In reality, the current "engine" is Storm + Kafka + HBase.  Each of these
could be independently swapped out once Metron is just a DSL with multiple
underlying engines.

Ok, I'll stop.

On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen  wrote:

> Chasing this bad idea down even further leads me to something even
> crazier.
>
> Stellar 1.0 can only operate within a single topology and in most cases
> only on a single message.  Stellar 2.0 could be the mechanism that allows
> users to define their own data flows and what "useful bits of Metron
> functionality" get plugged-in.
>
> Once, you have a DSL that allows users to define what they want Metron to
> do, then the underlying implementation mechanism (which is currently Storm)
> can also be swapped-out.  If we have an even faster Storm implementation,
> then we swap in the Storm NG engine.  Maybe we want Metron to also run in
> Flink, then we just swap-in a Flink engine.
>
>
>
>
> On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen  wrote:
>
>> I totally "bird dogged the previous thread" as Casey likes to call it. :)
>>  I am extracting this thought into a separate thread before I start
>> throwing out even more, crazier ideas.
>>
>> In general, Metron is very opinionated about data flows right now.  We
>>> have Parser topologies that feed an Enrichment topology, which then feeds
>>> an Indexing topology.  We have useful bits of functionality (think Stellar
>>> transforms, Geo enrichment, etc) that are closely coupled with these
>>> topologies (aka data flows).
>>>
>>
>>
>>> When a user wants to parse heterogenous data from a single topic, that's
>>> not easy.  When a user wants enriched output to land in unique topics by
>>> sensor type, well, that's also not easy.When a user wanted to skip
>>> enrichment of data sources, we actually re-architected the data flow to add
>>> the Indexing topology.
>>>
>>
>>
>>> In an ideal world, a user should be responsible for defining the data
>>> flow, not Metron.  Metron should provide the "useful bits of functionality"
>>> that a user can "plugin" wherever they like.  Metron itself should not care
>>> how the data is moving or what step in the process it is at.
>>
>>
>>
>>
>> --
>> Nick Allen 
>>
>
>
>
> --
> Nick Allen 
>



-- 
Nick Allen 


Re: [DISCUSS] Opinionated Data Flows

2016-10-06 Thread Nick Allen
Chasing this bad idea down even further leads me to something even crazier.

Stellar 1.0 can only operate within a single topology and in most cases
only on a single message.  Stellar 2.0 could be the mechanism that allows
users to define their own data flows and what "useful bits of Metron
functionality" get plugged-in.

Once, you have a DSL that allows users to define what they want Metron to
do, then the underlying implementation mechanism (which is currently Storm)
can also be swapped-out.  If we have an even faster Storm implementation,
then we swap in the Storm NG engine.  Maybe we want Metron to also run in
Flink, then we just swap-in a Flink engine.




On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen  wrote:

> I totally "bird dogged the previous thread" as Casey likes to call it. :)
>  I am extracting this thought into a separate thread before I start
> throwing out even more, crazier ideas.
>
> In general, Metron is very opinionated about data flows right now.  We
>> have Parser topologies that feed an Enrichment topology, which then feeds
>> an Indexing topology.  We have useful bits of functionality (think Stellar
>> transforms, Geo enrichment, etc) that are closely coupled with these
>> topologies (aka data flows).
>>
>
>
>> When a user wants to parse heterogenous data from a single topic, that's
>> not easy.  When a user wants enriched output to land in unique topics by
>> sensor type, well, that's also not easy.When a user wanted to skip
>> enrichment of data sources, we actually re-architected the data flow to add
>> the Indexing topology.
>>
>
>
>> In an ideal world, a user should be responsible for defining the data
>> flow, not Metron.  Metron should provide the "useful bits of functionality"
>> that a user can "plugin" wherever they like.  Metron itself should not care
>> how the data is moving or what step in the process it is at.
>
>
>
>
> --
> Nick Allen 
>



-- 
Nick Allen