oops, typo: A writer should not o̶c̶c̶u̶r̶ *care* in which topology or for what sensor type it is being used.
On Tue, Oct 11, 2016 at 1:46 PM, Nick Allen <n...@nickallen.org> wrote: > >> I disagree with the idea that Metron should not be responsible for >> defining >> data flows and I think that conflicts with the idea of abstracting out the >> CEP component (Storm, Flink, etc). > > > When I say that a user should be able to define the data flow, I don't > mean that in terms of the underlying implementation; aka topologies. I > mean that from a user's perspective. A user should be able to define the > sequence of validations, transformation, and enrichments that occur (or do > not occur). > > Maybe I over-generalized in my rant around the data flow. There are two > concerns that led me to this idea of allowing a user to define the data > flow. > > > (1) The first is from the user's perspective. Users need to have enough > power and expressiveness to easily capture, transform, enrich and act on > the data that exists in their environment. > > Another good concrete example of this popped up today. Casey just opened > METRON-496, that I believe also highlights the problem. > > *METRON-496: Field transformations are applied after validation, which > means that the validation cannot be affected by the transformations. > Consider a situation where you get a timestamp field in as a string and the > parser validation expects a long. Conversion could be done as part of a > field transformation, whereas now it would fail validation.* > > > Based on our current topology design, we have effectively "hard coded" > that validations occur prior to transformations. This effectively limits > what a user can do. How can we not do this to the user? Isn't there some > way that we can allow the user to define the sequence of transformations, > validations, and enrichments? > > > (2) My second concern is more from the developer's perspective. Most of > the functionality we have, is in some way dependent on the topology that it > is used in. We have useful bits of functionality (think Stellar > transforms, Geo enrichment, etc) that are closely coupled with our > topologies. > > A good example of this being that I could not reuse the existing "writer" > code base when implementing the Profiler. The "writer" code base has lots > of references to the topology and sensor type; concepts that do not exist > to the Profiler. This should all be factored out. A writer should not > occur in which topology or for what sensor type it is being used. > > Properly containing these concepts makes the code more reusable. An > example of how this could look is the HBaseBolt and HBaseMapper in > 'metron-hbase'. This allows any topology to write data to HBase. There is > nothing in that code that ties it to a specific topology or sensor type. > > > > > > On Mon, Oct 10, 2016 at 12:49 PM, Ryan Merriman <merrim...@gmail.com> > wrote: > >> I think this is a great discussion. I especially like the DSL examples >> that are given and think we should expand on that. The good news is that >> we are not far away from being able to actually implement it. It's just a >> matter of transforming that syntax into the zookeeper configs that drive >> the topologies. I think the underlying issue here is that the zookeeper >> configs are not intuitive and are hard to work with. Making them simpler >> or adding a layer on top that makes them simpler is necessary in my >> opinion. >> >> As for the edge cases that have come up and are mentioned in this >> thread ("parse >> heterogenous data from a single topic" and "enriched output to land in >> unique topics by sensor type"), a simple enhancement could solve both of >> those. Right now the output topic for parser and enrichment topologies >> are >> either passed in when building the topology (flux or constructor args) or >> retrieved from zookeeper. This limits you to 1 output topic per topology. >> Expanding the KafkaWriter class to optionally pull the output topic from a >> field in a parsed message or have it passed in as an input parameter to >> the >> write method should make it flexible enough to route messages to different >> topics. Also this statement is not entirely true: "You cannot use the >> output of one enrichment as the input to another". You can if you use a >> Stellar enrichment bolt and HBase enrichments. Geo and host enrichments >> would either need to be exposed through Stellar, or even better, converted >> to HBase enrichments. >> >> I disagree with the idea that Metron should not be responsible for >> defining >> data flows and I think that conflicts with the idea of abstracting out the >> CEP component (Storm, Flink, etc). There are patterns that emerge and >> tricks the community finds through experience that should be baked in. An >> example of this is the enrichment topologies. Grouping messages together >> by enrichment keys before enrichment allows us to put a caching layer in >> front which lightens the load on HBase and makes enrichment more >> efficient. If we put the responsibility of defining topologies on the >> user, now they have to be an expert in tuning whatever CEP is chosen as >> well as be knowledgable of established design patterns. Maybe the current >> state of Metron requires Storm tuning expertise anyways but I think we >> should trend away from that and evolve Metron to be more capable of making >> intelligent choices automatically. I remember the early days of Hive >> required careful consideration when writing queries to ensure the correct >> joins where used, data was distributed evenly, etc. Tuning Hive is easier >> now because it has evolved to be able to make more of these choices >> automatically without requiring users to have detailed knowledge of how >> things work internally. >> >> Ryan Merriman >> >> On Fri, Oct 7, 2016 at 7:12 AM, Nick Allen <n...@nickallen.org> wrote: >> >> > Whether it is explicit or implicit, I think that would be one of the >> major >> > benefits of having the expressiveness of a DSL. I can choose to have >> some >> > enrichments run in parallel (the split/join that you are referring to) >> or >> > have some enrichment runs serially. >> > >> > Having enrichments run serially is not something you can easily do with >> > Metron today. You cannot use the output of one enrichment as the input >> to >> > another. >> > >> > As a simple example, I have a blacklist of countries for which my >> > organization should not be doing business. I need to use the IP to find >> > the location and then use the location to match against a blacklist. I >> > need these enrichments to run serially. >> > >> > source("netflow") >> > -> parser("Netflow") >> > -> exists("ip_src_addr") >> > -> src_country = geo["ip_src_addr"].country >> > -> is_alert = blacklist["src_country"] >> > ... >> > >> > >> > >> > >> > On Thu, Oct 6, 2016 at 6:25 PM, Matt Foley <mfo...@hortonworks.com> >> wrote: >> > >> > > Would splitting and joining be implicit or explicit, for multi-path >> > > topologies? >> > > ________________________________________ >> > > From: zeo...@gmail.com <zeo...@gmail.com> >> > > Sent: Thursday, October 06, 2016 11:03 AM >> > > To: dev@metron.incubator.apache.org >> > > Subject: Re: [DISCUSS] Opinionated Data Flows >> > > >> > > It should also be smart enough to handle an order like: >> > > >> > > source("bro") >> > > -> parser("BasicBroParser") >> > > -> exists("ip_src_addr") >> > > -> geo_ip_src = geo["ip_src_addr"] >> > > -> application = assets["ip_src_addr"].application >> > > -> owner = assets["ip_src_addr"].owner >> > > -> exists("ip_dst_addr") >> > > -> geo_ip_dst = geo["ip_dst_addr"] >> > > -> elasticsearch("bro-index") >> > > >> > > Without duplicate hits of the topologies. >> > > >> > > Jon >> > > >> > > On Thu, Oct 6, 2016 at 1:55 PM Nick Allen <n...@nickallen.org> wrote: >> > > >> > > > Here is quick example with some hypothetical syntax. Whatever that >> > > syntax >> > > > might be, it would be very simple, easy to understand, and leverage >> > > > high-level concepts specific to Metron. >> > > > >> > > > This flow consumes Bro data, ensures there are valid >> source/destination >> > > > IPs, performs geo-enrichment, asset enrichment and finally persists >> the >> > > > data in Elasticsearch. >> > > > >> > > > >> > > > source("bro") >> > > > -> parser("BasicBroParser") >> > > > -> exists("ip_src_addr") >> > > > -> exists("ip_dst_addr") >> > > > -> geo_ip_src = geo["ip_src_addr"] >> > > > -> geo_ip_dst = geo["ip_dst_addr"] >> > > > -> application = assets["ip_src_addr"].application >> > > > -> owner = assets["ip_src_addr"].owner >> > > > -> elasticsearch("bro-index") >> > > > >> > > > >> > > > >> > > > >> > > > On Thu, Oct 6, 2016 at 12:58 PM, Nick Allen <n...@nickallen.org> >> > wrote: >> > > > >> > > > > Chasing this bad idea down even further leads me to something even >> > > > > crazier. >> > > > > >> > > > > Stellar 1.0 can only operate within a single topology and in most >> > cases >> > > > > only on a single message. Stellar 2.0 could be the mechanism that >> > > allows >> > > > > users to define their own data flows and what "useful bits of >> Metron >> > > > > functionality" get plugged-in. >> > > > > >> > > > > Once, you have a DSL that allows users to define what they want >> > Metron >> > > to >> > > > > do, then the underlying implementation mechanism (which is >> currently >> > > > Storm) >> > > > > can also be swapped-out. If we have an even faster Storm >> > > implementation, >> > > > > then we swap in the Storm NG engine. Maybe we want Metron to also >> > run >> > > in >> > > > > Flink, then we just swap-in a Flink engine. >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Thu, Oct 6, 2016 at 12:52 PM, Nick Allen <n...@nickallen.org> >> > > wrote: >> > > > > >> > > > >> I totally "bird dogged the previous thread" as Casey likes to >> call >> > it. >> > > > :) >> > > > >> I am extracting this thought into a separate thread before I >> start >> > > > >> throwing out even more, crazier ideas. >> > > > >> >> > > > >> In general, Metron is very opinionated about data flows right >> now. >> > We >> > > > >>> have Parser topologies that feed an Enrichment topology, which >> then >> > > > feeds >> > > > >>> an Indexing topology. We have useful bits of functionality >> (think >> > > > Stellar >> > > > >>> transforms, Geo enrichment, etc) that are closely coupled with >> > these >> > > > >>> topologies (aka data flows). >> > > > >>> >> > > > >> >> > > > >> >> > > > >>> When a user wants to parse heterogenous data from a single >> topic, >> > > > that's >> > > > >>> not easy. When a user wants enriched output to land in unique >> > topics >> > > > by >> > > > >>> sensor type, well, that's also not easy. When a user wanted >> to >> > > skip >> > > > >>> enrichment of data sources, we actually re-architected the data >> > flow >> > > > to add >> > > > >>> the Indexing topology. >> > > > >>> >> > > > >> >> > > > >> >> > > > >>> In an ideal world, a user should be responsible for defining the >> > data >> > > > >>> flow, not Metron. Metron should provide the "useful bits of >> > > > functionality" >> > > > >>> that a user can "plugin" wherever they like. Metron itself >> should >> > > not >> > > > care >> > > > >>> how the data is moving or what step in the process it is at. >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> -- >> > > > >> Nick Allen <n...@nickallen.org> >> > > > >> >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Nick Allen <n...@nickallen.org> >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Nick Allen <n...@nickallen.org> >> > > > >> > > -- >> > > >> > > Jon >> > > >> > >> > >> > >> > -- >> > Nick Allen <n...@nickallen.org> >> > >> > > > > -- > Nick Allen <n...@nickallen.org> > -- Nick Allen <n...@nickallen.org>