Re: [DISCUSS] Opinionated Data Flows

Nick Allen Tue, 11 Oct 2016 10:50:38 -0700

oops, typo:

  A writer should not o̶c̶c̶u̶r̶  *care*  in which topology or for what
sensor type it is being used.


On Tue, Oct 11, 2016 at 1:46 PM, Nick Allen <n...@nickallen.org> wrote:

>
>> I disagree with the idea that Metron should not be responsible for
>> defining
>> data flows and I think that conflicts with the idea of abstracting out the
>> CEP component (Storm, Flink, etc).
>
>
> When I say that a user should be able to define the data flow, I don't
> mean that in terms of the underlying implementation; aka topologies.  I
> mean that from a user's perspective.  A user should be able to define the
> sequence of validations, transformation, and enrichments that occur (or do
> not occur).
>
> Maybe I over-generalized in my rant around the data flow.  There are two
> concerns that led me to this idea of allowing a user to define the data
> flow.
>
>
> (1) The first is from the user's perspective.  Users need to have enough
> power and expressiveness to easily capture, transform, enrich and act on
> the data that exists in their environment.
>
> Another good concrete example of this popped up today.  Casey just opened
> METRON-496, that I believe also highlights the problem.
>
> *METRON-496: Field transformations are applied after validation, which
> means that the validation cannot be affected by the transformations.
> Consider a situation where you get a timestamp field in as a string and the
> parser validation expects a long.  Conversion could be done as part of a
> field transformation, whereas now it would fail validation.*
>
>
> Based on our current topology design, we have effectively "hard coded"
> that validations occur prior to transformations.  This effectively limits
> what a user can do.  How can we not do this to the user?  Isn't there some
> way that we can allow the user to define the sequence of transformations,
> validations, and enrichments?
>
>
> (2) My second concern is more from the developer's perspective.  Most of
> the functionality we have, is in some way dependent on the topology that it
> is used in.  We have useful bits of functionality (think Stellar
> transforms, Geo enrichment, etc) that are closely coupled with our
> topologies.
>
> A good example of this being that I could not reuse the existing "writer"
> code base when implementing the Profiler.  The "writer" code base has lots
> of references to the topology and sensor type; concepts that do not exist
> to the Profiler.  This should all be factored out.  A writer should not
> occur in which topology or for what sensor type it is being used.
>
> Properly containing these concepts makes the code more reusable. An
> example of how this could look is the HBaseBolt and HBaseMapper in
> 'metron-hbase'.  This allows any topology to write data to HBase.  There is
> nothing in that code that ties it to a specific topology or sensor type.
>
>
>
>
>
> On Mon, Oct 10, 2016 at 12:49 PM, Ryan Merriman <merrim...@gmail.com>
> wrote:
>
>> I think this is a great discussion.  I especially like the DSL examples
>> that are given and think we should expand on that.  The good news is that
>> we are not far away from being able to actually implement it.  It's just a
>> matter of transforming that syntax into the zookeeper configs that drive
>> the topologies.  I think the underlying issue here is that the zookeeper
>> configs are not intuitive and are hard to work with.  Making them simpler
>> or adding a layer on top that makes them simpler is necessary in my
>> opinion.
>>
>> As for the edge cases that have come up and are mentioned in this
>> thread ("parse
>> heterogenous data from a single topic" and "enriched output to land in
>> unique topics by sensor type"), a simple enhancement could solve both of
>> those.  Right now the output topic for parser and enrichment topologies
>> are
>> either passed in when building the topology (flux or constructor args) or
>> retrieved from zookeeper.  This limits you to 1 output topic per topology.
>> Expanding the KafkaWriter class to optionally pull the output topic from a
>> field in a parsed message or have it passed in as an input parameter to
>> the
>> write method should make it flexible enough to route messages to different
>> topics.  Also this statement is not entirely true:  "You cannot use the
>> output of one enrichment as the input to another".  You can if you use a
>> Stellar enrichment bolt and HBase enrichments.  Geo and host enrichments
>> would either need to be exposed through Stellar, or even better, converted
>> to HBase enrichments.
>>
>> I disagree with the idea that Metron should not be responsible for
>> defining
>> data flows and I think that conflicts with the idea of abstracting out the
>> CEP component (Storm, Flink, etc).  There are patterns that emerge and
>> tricks the community finds through experience that should be baked in.  An
>> example of this is the enrichment topologies.  Grouping messages together
>> by enrichment keys before enrichment allows us to put a caching layer in
>> front which lightens the load on HBase and makes enrichment more
>> efficient.  If we put the responsibility of defining topologies on the
>> user, now they have to be an expert in tuning whatever CEP is chosen as
>> well as be knowledgable of established design patterns.  Maybe the current
>> state of Metron requires Storm tuning expertise anyways but I think we
>> should trend away from that and evolve Metron to be more capable of making
>> intelligent choices automatically.  I remember the early days of Hive
>> required careful consideration when writing queries to ensure the correct
>> joins where used, data was distributed evenly, etc.  Tuning Hive is easier
>> now because it has evolved to be able to make more of these choices
>> automatically without requiring users to have detailed knowledge of how
>> things work internally.
>>
>> Ryan Merriman
>>
>> On Fri, Oct 7, 2016 at 7:12 AM, Nick Allen <n...@nickallen.org> wrote:
>>
>> > Whether it is explicit or implicit, I think that would be one of the
>> major
>> > benefits of having the expressiveness of a DSL.  I can choose to have
>> some
>> > enrichments run in parallel (the split/join that you are referring to)
>> or
>> > have some enrichment runs serially.
>> >
>> > Having enrichments run serially is not something you can easily do with
>> > Metron today.  You cannot use the output of one enrichment as the input
>> to
>> > another.
>> >
>> > As a simple example, I have a blacklist of countries for which my
>> > organization should not be doing business.  I need to use the IP to find
>> > the location and then use the location to match against a blacklist.  I
>> > need these enrichments to run serially.
>> >
>> > source("netflow")
>> >   -> parser("Netflow")
>> >   -> exists("ip_src_addr")
>> >   -> src_country = geo["ip_src_addr"].country
>> >   -> is_alert = blacklist["src_country"]
>> >   ...
>> >
>> >
>> >
>> >
>> > On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfo...@hortonworks.com>
>> wrote:
>> >
>> > > Would splitting and joining be implicit or explicit, for multi-path
>> > > topologies?
>> > > ________________________________________
>> > > From: zeo...@gmail.com <zeo...@gmail.com>
>> > > Sent: Thursday, October 06, 2016 11:03 AM
>> > > To: dev@metron.incubator.apache.org
>> > > Subject: Re: [DISCUSS] Opinionated Data Flows
>> > >
>> > > It should also be smart enough to handle an order like:
>> > >
>> > > source("bro")
>> > >   -> parser("BasicBroParser")
>> > >   -> exists("ip_src_addr")
>> > >   -> geo_ip_src = geo["ip_src_addr"]
>> > >   -> application = assets["ip_src_addr"].application
>> > >   -> owner = assets["ip_src_addr"].owner
>> > >   -> exists("ip_dst_addr")
>> > >   -> geo_ip_dst = geo["ip_dst_addr"]
>> > >   -> elasticsearch("bro-index")
>> > >
>> > > Without duplicate hits of the topologies.
>> > >
>> > > Jon
>> > >
>> > > On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <n...@nickallen.org> wrote:
>> > >
>> > > > Here is quick example with some hypothetical syntax.  Whatever that
>> > > syntax
>> > > > might be, it would be very simple, easy to understand, and leverage
>> > > > high-level concepts specific to Metron.
>> > > >
>> > > > This flow consumes Bro data, ensures there are valid
>> source/destination
>> > > > IPs, performs geo-enrichment, asset enrichment and finally persists
>> the
>> > > > data in Elasticsearch.
>> > > >
>> > > >
>> > > > source("bro")
>> > > >   -> parser("BasicBroParser")
>> > > >   -> exists("ip_src_addr")
>> > > >   -> exists("ip_dst_addr")
>> > > >   -> geo_ip_src = geo["ip_src_addr"]
>> > > >   -> geo_ip_dst = geo["ip_dst_addr"]
>> > > >   -> application = assets["ip_src_addr"].application
>> > > >   -> owner = assets["ip_src_addr"].owner
>> > > >   -> elasticsearch("bro-index")
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen <n...@nickallen.org>
>> > wrote:
>> > > >
>> > > > > Chasing this bad idea down even further leads me to something even
>> > > > > crazier.
>> > > > >
>> > > > > Stellar 1.0 can only operate within a single topology and in most
>> > cases
>> > > > > only on a single message.  Stellar 2.0 could be the mechanism that
>> > > allows
>> > > > > users to define their own data flows and what "useful bits of
>> Metron
>> > > > > functionality" get plugged-in.
>> > > > >
>> > > > > Once, you have a DSL that allows users to define what they want
>> > Metron
>> > > to
>> > > > > do, then the underlying implementation mechanism (which is
>> currently
>> > > > Storm)
>> > > > > can also be swapped-out.  If we have an even faster Storm
>> > > implementation,
>> > > > > then we swap in the Storm NG engine.  Maybe we want Metron to also
>> > run
>> > > in
>> > > > > Flink, then we just swap-in a Flink engine.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen <n...@nickallen.org>
>> > > wrote:
>> > > > >
>> > > > >> I totally "bird dogged the previous thread" as Casey likes to
>> call
>> > it.
>> > > > :)
>> > > > >>  I am extracting this thought into a separate thread before I
>> start
>> > > > >> throwing out even more, crazier ideas.
>> > > > >>
>> > > > >> In general, Metron is very opinionated about data flows right
>> now.
>> > We
>> > > > >>> have Parser topologies that feed an Enrichment topology, which
>> then
>> > > > feeds
>> > > > >>> an Indexing topology.  We have useful bits of functionality
>> (think
>> > > > Stellar
>> > > > >>> transforms, Geo enrichment, etc) that are closely coupled with
>> > these
>> > > > >>> topologies (aka data flows).
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>> When a user wants to parse heterogenous data from a single
>> topic,
>> > > > that's
>> > > > >>> not easy.  When a user wants enriched output to land in unique
>> > topics
>> > > > by
>> > > > >>> sensor type, well, that's also not easy.    When a user wanted
>> to
>> > > skip
>> > > > >>> enrichment of data sources, we actually re-architected the data
>> > flow
>> > > > to add
>> > > > >>> the Indexing topology.
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>> In an ideal world, a user should be responsible for defining the
>> > data
>> > > > >>> flow, not Metron.  Metron should provide the "useful bits of
>> > > > functionality"
>> > > > >>> that a user can "plugin" wherever they like.  Metron itself
>> should
>> > > not
>> > > > care
>> > > > >>> how the data is moving or what step in the process it is at.
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Nick Allen <n...@nickallen.org>
>> > > > >>
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Nick Allen <n...@nickallen.org>
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Nick Allen <n...@nickallen.org>
>> > > >
>> > > --
>> > >
>> > > Jon
>> > >
>> >
>> >
>> >
>> > --
>> > Nick Allen <n...@nickallen.org>
>> >
>>
>
>
>
> --
> Nick Allen <n...@nickallen.org>
>



-- 
Nick Allen <n...@nickallen.org>

Re: [DISCUSS] Opinionated Data Flows

Reply via email to