RE: separate JIRA for MPack/Ansible. No objection to tracking them separately, but for this item to be complete, you'll need both the feature and the ability to install it.
-D... On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <merrim...@gmail.com> wrote: > Assuming we're going to write all errors to a single error topic, I think > it makes sense to agree on an error message schema and handle errors across > the 3 different topologies in the same way with a single implementation. > The implementation in ParserBolt (ErrorUtils.handleError) produces the most > verbose error object so I think it's a good candidate for the single > implementation. Here is the message structure it currently produces: > > { > "exception": "java.lang.Exception: there was an error", > "hostname": "host", > "stack": "java.lang.Exception: ...", > "time": 1485295416563, > "message": "there was an error", > "rawMessage": "raw message", > "rawMessage_bytes": [], > "source.type": "bro_error" > } > > From our discussion so far we need to add a couple fields: an error type > and hash id. Adding these to the message looks like: > > { > "exception": "java.lang.Exception: there was an error", > "hostname": "host", > "stack": "java.lang.Exception: ...", > "time": 1485295416563, > "message": "there was an error", > "rawMessage": "raw message", > "rawMessage_bytes": [], > "source.type": "bro_error", > "error.type": "parser_error", > "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9" > } > > We should also consider expanding the error types I listed earlier. > Instead of just having "indexing_error" we could have > "elasticsearch_indexing_error", "hdfs_indexing_error" and so on. > > Jon, if an exception happens in an enrichment or threat intel bolt the > message is passed along with no error thrown (only logged). Everywhere > else I'm having trouble identifying specific fields that should be hashed. > Would hashing the message in every case be acceptable? Do you know of a > place where we could hash a field instead? On the topic of exceptions in > enrichments, are we ok with an error only being logged and not added to the > message or emitted to the error queue? > > > > On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <merrim...@gmail.com> > wrote: > > > That use case makes sense to me. I don't think it will require that much > > additional effort either. > > > > On Tue, Jan 24, 2017 at 1:02 PM, zeo...@gmail.com <zeo...@gmail.com> > > wrote: > > > >> Regarding error vs validation - Either way I'm not very concerned. I > >> initially assumed they would be combined and agree with that approach, > but > >> splitting them out isn't a very big deal to me either. > >> > >> Re: Ryan. Yes, exactly. In the case of a parser issue (or anywhere > else > >> where it's not possible to pick out the exact thing causing the issue) > it > >> would be a hash of the complete message. > >> > >> Regarding the architecture, I mostly agree with James except that I > think > >> step 3 needs to also be able to somehow group errors via the original > >> data (identify > >> replays, identify repeat issues with data in a specific field, issues > with > >> consistently different data, etc.). This is essentially the first step > of > >> troubleshooting, which I assume you are doing if you're looking at the > >> error dashboard. > >> > >> If the hash gets moved out of the initial implementation, I'm fairly > >> certain you lose this ability. The point here isn't to handle long > fields > >> (although that's a benefit of this approach), it's to attach a unique > >> identifier to the error/validation issue message that links it to the > >> original problem. I'd be happy to consider alternative solutions to > this > >> problem (for instance, actually sending across the data itself) I just > >> haven't been able to think of another way to do this that I like better. > >> > >> Jon > >> > >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <merrim...@gmail.com> > >> wrote: > >> > >> > We also need a JIRA for any install/Ansible/MPack work needed. > >> > > >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <jsir...@apache.org> > >> wrote: > >> > > >> > > Now that I had some time to think about it I would collapse all > error > >> and > >> > > validation topics into one. We can differentiate between different > >> views > >> > > of the data (split by error source etc) via Kibana dashboards. I > >> would > >> > > implement this feature incrementally. First I would modify all the > >> bolts > >> > > to log to a single topic. Second, I would get the error indexing > >> done by > >> > > attaching the indexing topology to the error topic. Third I would > >> create > >> > > the necessary dashboards to view errors and validation failures by > >> > source. > >> > > Lastly, I would file a follow-on JIRA to introduce hashing of errors > >> or > >> > > fields that are too long. It seems like a separate feature that we > >> need > >> > to > >> > > think through. We may need a stellar function around that. > >> > > > >> > > Thanks, > >> > > James > >> > > > >> > > 24.01.2017, 10:25, "Ryan Merriman" <merrim...@gmail.com>: > >> > > > I understand what Jon is talking about. He's proposing we hash the > >> > value > >> > > > that caused the error, not necessarily the error message itself. > >> For an > >> > > > enrichment this is easy. Just pass along the field value that > failed > >> > > > enrichment. For other cases the field that caused the error may > not > >> be > >> > so > >> > > > obvious. Take parser validation for example. The message is > >> validated > >> > as > >> > > > a whole and it may not be easy to determine which field is the > >> cause. > >> > In > >> > > > that case would a hash of the whole message work? > >> > > > > >> > > > There is a broader architectural discussion that needs to happen > >> before > >> > > we > >> > > > can implement this. Currently we have an indexing topology that > >> reads > >> > > from > >> > > > 1 topic and writes messages to ES but errors are written to > several > >> > > > different topics: > >> > > > > >> > > > - parser_error > >> > > > - parser_invalid > >> > > > - enrichments_error > >> > > > - threatintel_error > >> > > > - indexing_error > >> > > > > >> > > > I can see 4 possible approaches to implementing this: > >> > > > > >> > > > 1. Create an index topology for each error topic > >> > > > 1. Good because we can easily reuse the indexing topology > and > >> > would > >> > > > require the least development effort > >> > > > 2. Bad because it would consume a lot of extra worker slots > >> > > > 2. Move the topic name into the error JSON message as a new > >> > > "error_type" > >> > > > field and write all messages to the indexing topic > >> > > > 1. Good because we don't need to create a new topology > >> > > > 2. Bad because we would be flowing data and errors through > the > >> > same > >> > > > topology. A spike in errors could affect message indexing. > >> > > > 3. Compromise between 1 and 2. Create another indexing topology > >> that > >> > > is > >> > > > dedicated to indexing errors. Move the topic name into the > error > >> > JSON > >> > > > message as a new "error_type" field and write all errors to a > >> single > >> > > error > >> > > > topic. > >> > > > 4. Write a completely new topology with multiple spouts (1 for > >> each > >> > > > error type listed above) that all feed into a single > >> > > BulkMessageWriterBolt. > >> > > > 1. Good because the current topologies would not need to > >> change > >> > > > 2. Bad because it would require the most development effort, > >> > would > >> > > > not reuse existing topologies and takes up more worker slots > >> > than 3 > >> > > > > >> > > > Are there other approaches I haven't thought of? I think 1 and 2 > are > >> > off > >> > > > the table because they are shortcuts and not good long-term > >> solutions. > >> > 3 > >> > > > would be my choice because it introduces less complexity than 4. > >> > > Thoughts? > >> > > > > >> > > > Ryan > >> > > > > >> > > > On Mon, Jan 23, 2017 at 5:44 PM, zeo...@gmail.com < > zeo...@gmail.com > >> > > >> > > wrote: > >> > > > > >> > > >> In that case the hash would be of the value in the IP field, > such > >> as > >> > > >> sha3(8.8.8.8). > >> > > >> > >> > > >> Jon > >> > > >> > >> > > >> On Mon, Jan 23, 2017, 6:41 PM James Sirota <jsir...@apache.org> > >> > wrote: > >> > > >> > >> > > >> > Jon, > >> > > >> > > >> > > >> > I am still not entirely following why we would want to use > >> hashing. > >> > > For > >> > > >> > example if my error is "Your IP field is invalid and failed > >> > > validation" > >> > > >> > hashing this error string will always result in the same hash. > >> Why > >> > > not > >> > > >> > just use the actual error string? Can you provide an example > >> where > >> > > you > >> > > >> > would use it? > >> > > >> > > >> > > >> > Thanks, > >> > > >> > James > >> > > >> > > >> > > >> > 23.01.2017, 16:29, "zeo...@gmail.com" <zeo...@gmail.com>: > >> > > >> > > For 1 - I'm good with that. > >> > > >> > > > >> > > >> > > I'm talking about hashing the relevant content itself not > the > >> > > error. > >> > > >> Some > >> > > >> > > benefits are (1) minimize load on search index (there's > >> minimal > >> > > benefit > >> > > >> > in > >> > > >> > > spending the CPU and disk to keep it at full fidelity > >> (tokenize > >> > and > >> > > >> > store)) > >> > > >> > > (2) provide something to key on for dashboards (assuming a > >> good > >> > > hash > >> > > >> > > algorithm that avoids collisions and is second preimage > >> > resistant) > >> > > and > >> > > >> > (3) > >> > > >> > > specific to errors, if the issue is that it failed to > index, a > >> > hash > >> > > >> gives > >> > > >> > > us some protection that the issue will not occur twice. > >> > > >> > > > >> > > >> > > Jon > >> > > >> > > > >> > > >> > > On Mon, Jan 23, 2017, 2:47 PM James Sirota < > >> jsir...@apache.org> > >> > > wrote: > >> > > >> > > > >> > > >> > > Jon, > >> > > >> > > > >> > > >> > > With regards to 1, collapsing to a single dashboard for each > >> > would > >> > > be > >> > > >> > > fine. So we would have one error index and one "failed to > >> > validate" > >> > > >> > > index. The distinction is that errors would be things that > >> went > >> > > wrong > >> > > >> > > during stream processing (failed to parse, etc...), while > >> > > validation > >> > > >> > > failures are messages that explicitly failed stellar > >> > > validation/schema > >> > > >> > > enforcement. There should be relatively few of the second > >> type. > >> > > >> > > > >> > > >> > > With respect to 3, why do you want the error hashed? Why not > >> just > >> > > >> search > >> > > >> > > for the error text? > >> > > >> > > > >> > > >> > > Thanks, > >> > > >> > > James > >> > > >> > > > >> > > >> > > 20.01.2017, 14:01, "zeo...@gmail.com" <zeo...@gmail.com>: > >> > > >> > >> As someone who currently fills the platform engineer role, > I > >> can > >> > > give > >> > > >> > this > >> > > >> > >> idea a huge +1. My thoughts: > >> > > >> > >> > >> > > >> > >> 1. I think it depends on exactly what data is pushed into > the > >> > > index > >> > > >> > (#3). > >> > > >> > >> However, assuming the errors you proposed recording, I > can't > >> see > >> > > huge > >> > > >> > >> benefits to having more than one dashboard. I would be > happy > >> to > >> > be > >> > > >> > >> persuaded otherwise. > >> > > >> > >> > >> > > >> > >> 2. I would say yes, storing the errors in HDFS in addition > to > >> > > >> indexing > >> > > >> > is > >> > > >> > >> a good thing. Using METRON-510 > >> > > >> > >> <https://issues.apache.org/jira/browse/METRON-510> as a > case > >> > > study, > >> > > >> > there > >> > > >> > >> is the potential in this environment for > attacker-controlled > >> > data > >> > > to > >> > > >> > > > >> > > >> > > result > >> > > >> > >> in processing errors which could be a method of evading > >> security > >> > > >> > >> monitoring. Once an attack is identified, the long term > HDFS > >> > > storage > >> > > >> > would > >> > > >> > >> allow better historical analysis for > low-and-slow/persistent > >> > > attacks > >> > > >> > (I'm > >> > > >> > >> thinking of a method of data exfil that also won't > >> successfully > >> > > get > >> > > >> > stored > >> > > >> > >> in Lucene, but is hard to identify over a short period of > >> time). > >> > > >> > >> - Along this line, I think that there are various parts of > >> > Metron > >> > > >> > (this > >> > > >> > >> included) which could benefit from having method of > >> configuring > >> > > data > >> > > >> > aging > >> > > >> > >> by bucket in HDFS (Following Nick's comments here > >> > > >> > >> <https://issues.apache.org/jira/browse/METRON-477>). > >> > > >> > >> > >> > > >> > >> 3. I would potentially add a hash of the content that > failed > >> > > >> > validation to > >> > > >> > >> help identify repeats over time with less of a concern that > >> > you'd > >> > > >> have > >> > > >> > > > >> > > >> > > back > >> > > >> > >> to back failures (i.e. instead of storing the value > itself). > >> > > >> > Additionally, > >> > > >> > >> I think it's helpful to be able to search all times there > >> was an > >> > > >> > indexing > >> > > >> > >> error (instead of it hitting the catch-all). > >> > > >> > >> > >> > > >> > >> Jon > >> > > >> > >> > >> > > >> > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota < > >> > jsir...@apache.org> > >> > > >> > wrote: > >> > > >> > >> > >> > > >> > >> We already have a capability to capture bolt errors and > >> > validation > >> > > >> > errors > >> > > >> > >> and pipe them into a Kafka topic. I want to propose that we > >> > > attach a > >> > > >> > >> writer topology to the error and validation failed kafka > >> topics > >> > so > >> > > >> > that we > >> > > >> > >> can (a) create a new ES index for these errors and (b) > >> create a > >> > > new > >> > > >> > Kibana > >> > > >> > >> dashboard to visualize them. The benefit would be that > errors > >> > and > >> > > >> > >> validation failures would be easier to see and analyze. > >> > > >> > >> > >> > > >> > >> I am seeking feedback on the following: > >> > > >> > >> > >> > > >> > >> - How granular would we want this feature to be? Think we > >> would > >> > > want > >> > > >> > one > >> > > >> > >> index/dashboard per source? Or would it be better to > collapse > >> > > >> > everything > >> > > >> > >> into the same index? > >> > > >> > >> - Do we care about storing these errors in HDFS as well? Or > >> is > >> > > >> indexing > >> > > >> > >> them enough? > >> > > >> > >> - What types of errors should we record? I am proposing: > >> > > >> > >> > >> > > >> > >> For error reporting: > >> > > >> > >> --Message failed to parse > >> > > >> > >> --Enrichment failed to enrich > >> > > >> > >> --Threat intel feed failures > >> > > >> > >> --Generic catch-all for all other errors > >> > > >> > >> > >> > > >> > >> For validation reporting: > >> > > >> > >> --What part of message failed validation > >> > > >> > >> --What stellar validator caused the failure > >> > > >> > >> > >> > > >> > >> ------------------- > >> > > >> > >> Thank you, > >> > > >> > >> > >> > > >> > >> James Sirota > >> > > >> > >> PPMC- Apache Metron (Incubating) > >> > > >> > >> jsirota AT apache DOT org > >> > > >> > >> > >> > > >> > >> -- > >> > > >> > >> > >> > > >> > >> Jon > >> > > >> > >> > >> > > >> > >> Sent from my mobile device > >> > > >> > > > >> > > >> > > ------------------- > >> > > >> > > Thank you, > >> > > >> > > > >> > > >> > > James Sirota > >> > > >> > > PPMC- Apache Metron (Incubating) > >> > > >> > > jsirota AT apache DOT org > >> > > >> > > > >> > > >> > > -- > >> > > >> > > > >> > > >> > > Jon > >> > > >> > > > >> > > >> > > Sent from my mobile device > >> > > >> > > >> > > >> > ------------------- > >> > > >> > Thank you, > >> > > >> > > >> > > >> > James Sirota > >> > > >> > PPMC- Apache Metron (Incubating) > >> > > >> > jsirota AT apache DOT org > >> > > >> > > >> > > >> -- > >> > > >> > >> > > >> Jon > >> > > >> > >> > > >> Sent from my mobile device > >> > > > >> > > ------------------- > >> > > Thank you, > >> > > > >> > > James Sirota > >> > > PPMC- Apache Metron (Incubating) > >> > > jsirota AT apache DOT org > >> > > > >> > > >> -- > >> > >> Jon > >> > >> Sent from my mobile device > >> > > > > >