FYI, the PR for this is up at https://github.com/apache/metron/pull/940 For those interested, please comment on the actual implementation there.
On Thu, Feb 22, 2018 at 12:43 PM, Casey Stella <ceste...@gmail.com> wrote: > So, these are good questions, as usual Otto :) > > > how does this effect the distribution of work through the cluster, and > resiliency of the topologies? > > This moves us to a data parallelism scheme rather than a task parallelism > scheme. This, in effect means, that we will not be distributing the > partial enrichments across the network for a given message, but rather > distributing the messages across the network for *full* enrichment. So, > the bundle of work is the same, but we're not concentrating capabilities in > specific workers. Then again, as soon as we moved to stellar enrichments > and sub-groups where you can interact with hbase or geo from within > stellar, we sorta abandoned specialization. Resiliency shouldn't be > effected and, indeed, it should be easier to reason about. We ack after > every bolt in the new scheme rather than avoid acking until we join and ack > the original tuple. In fact, I'm still not convinced there's not a bug > somewhere in that join bolt that makes it so we don't ack the right tuple. > > > Is anyone else doing it like this? > > The stormy way of doing this is to specialize in the bolts and join, no > doubt, in a fan-out/fan-in pattern. I do not think it's unheard of, > though, to use a threadpool. It's slightly peculiar inasmuch as storm has > its own threading model, but it is an embarassingly parallel task and the > main shift is trading the unit of parallelism from enrichment task to > message to the gain of fewer network hops. That being said, as long as > you're not emitting from a different thread that you are receiving from, > there's no technical limitation. > > > Can we have multiple thread pools and group tasks together ( or separate > them ) wrt hbase? > > We could, but I think we might consider starting with just a simple static > threadpool that we configure at the topology level (e.g. multiple worker > threads can share the same threadpool that we can configure). I think as > the trend of moving everything to stellar continues, we may end up in a > situation where we don't have a coherent or clear way to differentiate > between thread pools like we do now. > > > Also, how are we to measure the effect? > > Well, some of the benefits here are at an architectural/feature level, the > most exciting of which is that this approach opens up avenues for stellar > subgroups to depend on each other. Slightly less exciting, but still nice > is the fact that this normalizes us with *other* streaming technologies and > the decoupling work done as part of the PR (soon to be released) will make > it easy to transition if we so desire. Beyond that, for performance, > someone will have to run some performance tests or try it out in a > situation where they're having some enrichment performance issues. Until > we do that, I think we should probably just keep it as a parallel approach > that you can swap out if you so desire. > > On Thu, Feb 22, 2018 at 11:48 AM, Otto Fowler <ottobackwa...@gmail.com> > wrote: > >> This sounds worth exploring. A couple of questions: >> >> * how does this effect the distribution of work through the cluster, and >> resiliency of the topologies? >> * Is anyone else doing it like this? >> * Can we have multiple thread pools and group tasks together ( or >> separate them ) wrt hbase? >> >> >> >> On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com) >> wrote: >> >> Hi all, >> >> I've been thinking and working on something that I wanted to get some >> feedback on. The way that we do our enrichments, the split/join >> architecture was created to effectively to parallel enrichments in a >> storm-like way in contrast to OpenSoc. >> >> There are some good parts to this architecture: >> >> - It works, enrichments are done in parallel >> - You can tune individual enrichments differently >> - It's very storm-like >> >> There are also some deficiencies: >> >> - It's hard to reason about >> - Understanding the latency of enriching a message requires looking >> at multiple bolts that each give summary statistics >> - The join bolt's cache is really hard to reason about when performance >> tuning >> - During spikes in traffic, you can overload the join bolt's cache >> and drop messages if you aren't careful >> - In general, it's hard to associate a cache size and a duration kept >> in cache with throughput and latency >> - There are a lot of network hops per message >> - Right now we are stuck at 2 stages of transformations being done >> (enrichment and threat intel). It's very possible that you might want >> stellar enrichments to depend on the output of other stellar enrichments. >> In order to implement this in split/join you'd have to create a cycle in >> the storm topology >> >> I propose a change. I propose that we move to a model where we do >> enrichments in a single bolt in parallel using a static threadpool (e.g. >> multiple workers in the same process would share the threadpool). IN all >> other ways, this would be backwards compatible. A transparent drop-in for >> the existing enrichment topology. >> >> There are some pros/cons about this too: >> >> - Pro >> - Easier to reason about from an individual message perspective >> - Architecturally decoupled from Storm >> - This sets us up if we want to consider other streaming >> technologies >> - Fewer bolts >> - spout -> enrichment bolt -> threatintel bolt -> output bolt >> - Way fewer network hops per message >> - currently 2n+1 where n is the number of enrichments used (if >> using stellar subgroups, each subgroup is a hop) >> - Easier to reason about from a performance perspective >> - We trade cache size and eviction timeout for threadpool size >> - We set ourselves up to have stellar subgroups with dependencies >> - i.e. stellar subgroups that depend on the output of other >> subgroups >> - If we do this, we can shrink the topology to just spout -> >> enrichment/threat intel -> output >> - Con >> - We can no longer tune stellar enrichments independent from HBase >> enrichments >> - To be fair, with enrichments moving to stellar, this is the case >> in the split/join approach too >> - No idea about performance >> >> >> What I propose is to submit a PR that will deliver an alternative, >> completely backwards compatible topology for enrichment that you can use >> by >> adjusting the start_enrichment_topology.sh script to use >> remote-unified.yaml instead of remote.yaml. If we live with it for a >> while >> and have some good experiences with it, maybe we can consider retiring >> the >> old enrichment topology. >> >> Thoughts? Keep me honest; if I have over or understated the issues for >> split/join or missed some important architectural issue let me know. I'm >> going to submit a PR to this effect by the EOD today so things will be >> more >> obvious. >> >> >