Simply as a unique identifier of the original information which is failing some step, and thus giving you something to key in on and create a count of unique events and prioritize issues without the concern of cyclical issues (if the issue is with indexing a specific message, and you try to index it again, it will just fail in a loop).
Jon On Wed, Feb 1, 2017 at 6:59 AM Dima Kovalyov <dima.koval...@sstech.us> wrote: > That's a great topic of discussion. > > Throughout the thread the idea of having hash of the message that failed > is changed, can someone please explain why do you plan to use this hash > and how? > > - Dima > > On 02/01/2017 06:23 AM, zeo...@gmail.com wrote: > > After thinking on this for a few days I recant my previous suggestion of > > TupleHash256. It's still a bit early for SHA-3 - no good reference > > implementations/libraries exist (I did some searching and emailing), it > is > > optimized for hardware but no hardware implementation is widely > accessible, > > FIPS 140-3 is still not close to finalized, etc. > > > > I think we could simulate the benefits of tuplehash by sorting the > tuples, > > then doing SHA-256(len(tuple1) | tuple1 | ... | len(tuplen) | tuplen). > > Happy to entertain opposing thoughts, such as BLAKE2, etc. but with the > > likely users of Metron, I think sticking with FIPS 140-2 is a solid > choice. > > > > Jon > > > > On Thu, Jan 26, 2017, 11:23 AM zeo...@gmail.com <zeo...@gmail.com> > wrote: > > > > So one more thing regarding why I think we should throw an exception on a > > failed enrichment. If we do make something like username a constant > field, > > in cases where that is used to calculate rawMessage_hash, if it fails to > > enrich, the hash would be different compared to when it succeeds. Of > > course I think the initial intent of adding username as a constant field > > would be to handle it in the parsers, where that information is provided > in > > the messages themselves, but how would Threat Intel know the difference? > > In my environment I am looking forward to a streaming enrichment that > adds > > the username, where applicable, anywhere I have an IP. > > > > My hesitant suggestion for a hashing algorithm would be to use > > TupleHash256, as it is a NIST-provided implementation of SHA-3 (using > > cSHAKE) for this use case. Details here > > < > http://nvlpubs.nist.gov/nistpubs/specialpublications/nist.sp.800-185.pdf>. > > However, I haven't been able to find a reference implementation of this > in > > any language, so that's a bit of a downside. A more general SHA3-256 > > implementation where we handle ordering could work as well, but would be > > significantly less optimal. > > > > Jon > > > > On Thu, Jan 26, 2017 at 10:20 AM Ryan Merriman <merrim...@gmail.com> > wrote: > > > > Jon, I misread the code in the GenericEnrichmentBolt. The error is > > forwarded on so no issues there. > > > > Defaulting to the common fields makes sense. I will dig into the > > GenericEnrichmentBolt more, maybe there is a way to get the error fields > > without having to significantly change things. Any opinion on a hashing > > algorithm? > > > > On Wed, Jan 25, 2017 at 9:37 PM, zeo...@gmail.com <zeo...@gmail.com> > wrote: > > > >> Although hashing the whole message is better than nothing, it misses a > lot > >> of the benefits we could get. > >> > >> While I'd love to have consistency for this field across all of the > >> different error.types, it appears that may not be reasonably possible > >> because of the parsers. So, how about something like hash all of the > >> constant > >> fields > >> <https://github.com/apache/incubator-metron/blob/master/ > >> metron-platform/metron-common/src/main/java/org/apache/ > >> metron/common/Constants.java> > >> excluding > >> timestamp and original_string unless it is a parser, in which case hash > > the > >> entire message? This gives us some measure of event uniqueness and it > can > >> grow as we define additional constant fields (I recall discussing with > >> someone else on the list regarding expanding those standard fields to > >> include things like usernames but I can't find the specific email > >> exchange). > >> > >> Because some enrichments can be heavily relied on, I think it makes > sense > >> to put a message onto the error queue when it throws an exception. Not > >> only does this help troubleshoot edge cases, but it makes issues more > >> obvious when assembling a new enrichment in dev/test. I can't think of > a > >> scenario currently where an enrichment would only be "best effort" and > > that > >> I wouldn't want that error indexed and retrievable. However, this gets > >> interesting when talking about the various options to solve the "Enrich > >> enrichment" discussion from earlier in the month. We can keep that part > > of > >> this separate though, as I don't think that's being actively pursued > right > >> now. > >> > >> Jon > >> > >> On Wed, Jan 25, 2017 at 10:49 AM David Lyle <dlyle65...@gmail.com> > wrote: > >> > >> RE: separate JIRA for MPack/Ansible. No objection to tracking them > >> separately, but for this item to be complete, you'll need both the > feature > >> and the ability to install it. > >> > >> -D... > >> > >> > >> On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <merrim...@gmail.com> > >> wrote: > >> > >>> Assuming we're going to write all errors to a single error topic, I > > think > >>> it makes sense to agree on an error message schema and handle errors > >> across > >>> the 3 different topologies in the same way with a single > implementation. > >>> The implementation in ParserBolt (ErrorUtils.handleError) produces the > >> most > >>> verbose error object so I think it's a good candidate for the single > >>> implementation. Here is the message structure it currently produces: > >>> > >>> { > >>> "exception": "java.lang.Exception: there was an error", > >>> "hostname": "host", > >>> "stack": "java.lang.Exception: ...", > >>> "time": 1485295416563, > >>> "message": "there was an error", > >>> "rawMessage": "raw message", > >>> "rawMessage_bytes": [], > >>> "source.type": "bro_error" > >>> } > >>> > >>> From our discussion so far we need to add a couple fields: an error > > type > >>> and hash id. Adding these to the message looks like: > >>> > >>> { > >>> "exception": "java.lang.Exception: there was an error", > >>> "hostname": "host", > >>> "stack": "java.lang.Exception: ...", > >>> "time": 1485295416563, > >>> "message": "there was an error", > >>> "rawMessage": "raw message", > >>> "rawMessage_bytes": [], > >>> "source.type": "bro_error", > >>> "error.type": "parser_error", > >>> "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9" > >>> } > >>> > >>> We should also consider expanding the error types I listed earlier. > >>> Instead of just having "indexing_error" we could have > >>> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on. > >>> > >>> Jon, if an exception happens in an enrichment or threat intel bolt the > >>> message is passed along with no error thrown (only logged). Everywhere > >>> else I'm having trouble identifying specific fields that should be > >> hashed. > >>> Would hashing the message in every case be acceptable? Do you know of > a > >>> place where we could hash a field instead? On the topic of exceptions > > in > >>> enrichments, are we ok with an error only being logged and not added to > >> the > >>> message or emitted to the error queue? > >>> > >>> > >>> > >>> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <merrim...@gmail.com> > >>> wrote: > >>> > >>>> That use case makes sense to me. I don't think it will require that > >> much > >>>> additional effort either. > >>>> > >>>> On Tue, Jan 24, 2017 at 1:02 PM, zeo...@gmail.com <zeo...@gmail.com> > >>>> wrote: > >>>> > >>>>> Regarding error vs validation - Either way I'm not very concerned. I > >>>>> initially assumed they would be combined and agree with that > > approach, > >>> but > >>>>> splitting them out isn't a very big deal to me either. > >>>>> > >>>>> Re: Ryan. Yes, exactly. In the case of a parser issue (or anywhere > >>> else > >>>>> where it's not possible to pick out the exact thing causing the > > issue) > >>> it > >>>>> would be a hash of the complete message. > >>>>> > >>>>> Regarding the architecture, I mostly agree with James except that I > >>> think > >>>>> step 3 needs to also be able to somehow group errors via the original > >>>>> data (identify > >>>>> replays, identify repeat issues with data in a specific field, issues > >>> with > >>>>> consistently different data, etc.). This is essentially the first > >> step > >>> of > >>>>> troubleshooting, which I assume you are doing if you're looking at > > the > >>>>> error dashboard. > >>>>> > >>>>> If the hash gets moved out of the initial implementation, I'm fairly > >>>>> certain you lose this ability. The point here isn't to handle long > >>> fields > >>>>> (although that's a benefit of this approach), it's to attach a unique > >>>>> identifier to the error/validation issue message that links it to the > >>>>> original problem. I'd be happy to consider alternative solutions to > >>> this > >>>>> problem (for instance, actually sending across the data itself) I > > just > >>>>> haven't been able to think of another way to do this that I like > >> better. > >>>>> Jon > >>>>> > >>>>> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <merrim...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> We also need a JIRA for any install/Ansible/MPack work needed. > >>>>>> > >>>>>> On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <jsir...@apache.org> > >>>>> wrote: > >>>>>>> Now that I had some time to think about it I would collapse all > >>> error > >>>>> and > >>>>>>> validation topics into one. We can differentiate between > >> different > >>>>> views > >>>>>>> of the data (split by error source etc) via Kibana dashboards. I > >>>>> would > >>>>>>> implement this feature incrementally. First I would modify all > >> the > >>>>> bolts > >>>>>>> to log to a single topic. Second, I would get the error indexing > >>>>> done by > >>>>>>> attaching the indexing topology to the error topic. Third I would > >>>>> create > >>>>>>> the necessary dashboards to view errors and validation failures > > by > >>>>>> source. > >>>>>>> Lastly, I would file a follow-on JIRA to introduce hashing of > >> errors > >>>>> or > >>>>>>> fields that are too long. It seems like a separate feature that > >> we > >>>>> need > >>>>>> to > >>>>>>> think through. We may need a stellar function around that. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> James > >>>>>>> > >>>>>>> 24.01.2017, 10:25, "Ryan Merriman" <merrim...@gmail.com>: > >>>>>>>> I understand what Jon is talking about. He's proposing we hash > >> the > >>>>>> value > >>>>>>>> that caused the error, not necessarily the error message > > itself. > >>>>> For an > >>>>>>>> enrichment this is easy. Just pass along the field value that > >>> failed > >>>>>>>> enrichment. For other cases the field that caused the error may > >>> not > >>>>> be > >>>>>> so > >>>>>>>> obvious. Take parser validation for example. The message is > >>>>> validated > >>>>>> as > >>>>>>>> a whole and it may not be easy to determine which field is the > >>>>> cause. > >>>>>> In > >>>>>>>> that case would a hash of the whole message work? > >>>>>>>> > >>>>>>>> There is a broader architectural discussion that needs to > > happen > >>>>> before > >>>>>>> we > >>>>>>>> can implement this. Currently we have an indexing topology that > >>>>> reads > >>>>>>> from > >>>>>>>> 1 topic and writes messages to ES but errors are written to > >>> several > >>>>>>>> different topics: > >>>>>>>> > >>>>>>>> - parser_error > >>>>>>>> - parser_invalid > >>>>>>>> - enrichments_error > >>>>>>>> - threatintel_error > >>>>>>>> - indexing_error > >>>>>>>> > >>>>>>>> I can see 4 possible approaches to implementing this: > >>>>>>>> > >>>>>>>> 1. Create an index topology for each error topic > >>>>>>>> 1. Good because we can easily reuse the indexing topology > >>> and > >>>>>> would > >>>>>>>> require the least development effort > >>>>>>>> 2. Bad because it would consume a lot of extra worker > >> slots > >>>>>>>> 2. Move the topic name into the error JSON message as a new > >>>>>>> "error_type" > >>>>>>>> field and write all messages to the indexing topic > >>>>>>>> 1. Good because we don't need to create a new topology > >>>>>>>> 2. Bad because we would be flowing data and errors > > through > >>> the > >>>>>> same > >>>>>>>> topology. A spike in errors could affect message > > indexing. > >>>>>>>> 3. Compromise between 1 and 2. Create another indexing > >> topology > >>>>> that > >>>>>>> is > >>>>>>>> dedicated to indexing errors. Move the topic name into the > >>> error > >>>>>> JSON > >>>>>>>> message as a new "error_type" field and write all errors to > > a > >>>>> single > >>>>>>> error > >>>>>>>> topic. > >>>>>>>> 4. Write a completely new topology with multiple spouts (1 > >> for > >>>>> each > >>>>>>>> error type listed above) that all feed into a single > >>>>>>> BulkMessageWriterBolt. > >>>>>>>> 1. Good because the current topologies would not need to > >>>>> change > >>>>>>>> 2. Bad because it would require the most development > >> effort, > >>>>>> would > >>>>>>>> not reuse existing topologies and takes up more worker > >> slots > >>>>>> than 3 > >>>>>>>> Are there other approaches I haven't thought of? I think 1 and > > 2 > >>> are > >>>>>> off > >>>>>>>> the table because they are shortcuts and not good long-term > >>>>> solutions. > >>>>>> 3 > >>>>>>>> would be my choice because it introduces less complexity than > > 4. > >>>>>>> Thoughts? > >>>>>>>> Ryan > >>>>>>>> > >>>>>>>> On Mon, Jan 23, 2017 at 5:44 PM, zeo...@gmail.com < > >>> zeo...@gmail.com > >>>>>>> wrote: > >>>>>>>>> In that case the hash would be of the value in the IP field, > >>> such > >>>>> as > >>>>>>>>> sha3(8.8.8.8). > >>>>>>>>> > >>>>>>>>> Jon > >>>>>>>>> > >>>>>>>>> On Mon, Jan 23, 2017, 6:41 PM James Sirota < > >> jsir...@apache.org> > >>>>>> wrote: > >>>>>>>>> > Jon, > >>>>>>>>> > > >>>>>>>>> > I am still not entirely following why we would want to use > >>>>> hashing. > >>>>>>> For > >>>>>>>>> > example if my error is "Your IP field is invalid and failed > >>>>>>> validation" > >>>>>>>>> > hashing this error string will always result in the same > >> hash. > >>>>> Why > >>>>>>> not > >>>>>>>>> > just use the actual error string? Can you provide an > > example > >>>>> where > >>>>>>> you > >>>>>>>>> > would use it? > >>>>>>>>> > > >>>>>>>>> > Thanks, > >>>>>>>>> > James > >>>>>>>>> > > >>>>>>>>> > 23.01.2017, 16:29, "zeo...@gmail.com" <zeo...@gmail.com>: > >>>>>>>>> > > For 1 - I'm good with that. > >>>>>>>>> > > > >>>>>>>>> > > I'm talking about hashing the relevant content itself not > >>> the > >>>>>>> error. > >>>>>>>>> Some > >>>>>>>>> > > benefits are (1) minimize load on search index (there's > >>>>> minimal > >>>>>>> benefit > >>>>>>>>> > in > >>>>>>>>> > > spending the CPU and disk to keep it at full fidelity > >>>>> (tokenize > >>>>>> and > >>>>>>>>> > store)) > >>>>>>>>> > > (2) provide something to key on for dashboards (assuming > > a > >>>>> good > >>>>>>> hash > >>>>>>>>> > > algorithm that avoids collisions and is second preimage > >>>>>> resistant) > >>>>>>> and > >>>>>>>>> > (3) > >>>>>>>>> > > specific to errors, if the issue is that it failed to > >>> index, a > >>>>>> hash > >>>>>>>>> gives > >>>>>>>>> > > us some protection that the issue will not occur twice. > >>>>>>>>> > > > >>>>>>>>> > > Jon > >>>>>>>>> > > > >>>>>>>>> > > On Mon, Jan 23, 2017, 2:47 PM James Sirota < > >>>>> jsir...@apache.org> > >>>>>>> wrote: > >>>>>>>>> > > > >>>>>>>>> > > Jon, > >>>>>>>>> > > > >>>>>>>>> > > With regards to 1, collapsing to a single dashboard for > >> each > >>>>>> would > >>>>>>> be > >>>>>>>>> > > fine. So we would have one error index and one "failed to > >>>>>> validate" > >>>>>>>>> > > index. The distinction is that errors would be things > > that > >>>>> went > >>>>>>> wrong > >>>>>>>>> > > during stream processing (failed to parse, etc...), while > >>>>>>> validation > >>>>>>>>> > > failures are messages that explicitly failed stellar > >>>>>>> validation/schema > >>>>>>>>> > > enforcement. There should be relatively few of the second > >>>>> type. > >>>>>>>>> > > > >>>>>>>>> > > With respect to 3, why do you want the error hashed? Why > >> not > >>>>> just > >>>>>>>>> search > >>>>>>>>> > > for the error text? > >>>>>>>>> > > > >>>>>>>>> > > Thanks, > >>>>>>>>> > > James > >>>>>>>>> > > > >>>>>>>>> > > 20.01.2017, 14:01, "zeo...@gmail.com" <zeo...@gmail.com>: > >>>>>>>>> > >> As someone who currently fills the platform engineer > >> role, > >>> I > >>>>> can > >>>>>>> give > >>>>>>>>> > this > >>>>>>>>> > >> idea a huge +1. My thoughts: > >>>>>>>>> > >> > >>>>>>>>> > >> 1. I think it depends on exactly what data is pushed > > into > >>> the > >>>>>>> index > >>>>>>>>> > (#3). > >>>>>>>>> > >> However, assuming the errors you proposed recording, I > >>> can't > >>>>> see > >>>>>>> huge > >>>>>>>>> > >> benefits to having more than one dashboard. I would be > >>> happy > >>>>> to > >>>>>> be > >>>>>>>>> > >> persuaded otherwise. > >>>>>>>>> > >> > >>>>>>>>> > >> 2. I would say yes, storing the errors in HDFS in > >> addition > >>> to > >>>>>>>>> indexing > >>>>>>>>> > is > >>>>>>>>> > >> a good thing. Using METRON-510 > >>>>>>>>> > >> <https://issues.apache.org/jira/browse/METRON-510> as a > >>> case > >>>>>>> study, > >>>>>>>>> > there > >>>>>>>>> > >> is the potential in this environment for > >>> attacker-controlled > >>>>>> data > >>>>>>> to > >>>>>>>>> > > > >>>>>>>>> > > result > >>>>>>>>> > >> in processing errors which could be a method of evading > >>>>> security > >>>>>>>>> > >> monitoring. Once an attack is identified, the long term > >>> HDFS > >>>>>>> storage > >>>>>>>>> > would > >>>>>>>>> > >> allow better historical analysis for > >>> low-and-slow/persistent > >>>>>>> attacks > >>>>>>>>> > (I'm > >>>>>>>>> > >> thinking of a method of data exfil that also won't > >>>>> successfully > >>>>>>> get > >>>>>>>>> > stored > >>>>>>>>> > >> in Lucene, but is hard to identify over a short period > > of > >>>>> time). > >>>>>>>>> > >> - Along this line, I think that there are various parts > >> of > >>>>>> Metron > >>>>>>>>> > (this > >>>>>>>>> > >> included) which could benefit from having method of > >>>>> configuring > >>>>>>> data > >>>>>>>>> > aging > >>>>>>>>> > >> by bucket in HDFS (Following Nick's comments here > >>>>>>>>> > >> <https://issues.apache.org/jira/browse/METRON-477>). > >>>>>>>>> > >> > >>>>>>>>> > >> 3. I would potentially add a hash of the content that > >>> failed > >>>>>>>>> > validation to > >>>>>>>>> > >> help identify repeats over time with less of a concern > >> that > >>>>>> you'd > >>>>>>>>> have > >>>>>>>>> > > > >>>>>>>>> > > back > >>>>>>>>> > >> to back failures (i.e. instead of storing the value > >>> itself). > >>>>>>>>> > Additionally, > >>>>>>>>> > >> I think it's helpful to be able to search all times > > there > >>>>> was an > >>>>>>>>> > indexing > >>>>>>>>> > >> error (instead of it hitting the catch-all). > >>>>>>>>> > >> > >>>>>>>>> > >> Jon > >>>>>>>>> > >> > >>>>>>>>> > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota < > >>>>>> jsir...@apache.org> > >>>>>>>>> > wrote: > >>>>>>>>> > >> > >>>>>>>>> > >> We already have a capability to capture bolt errors and > >>>>>> validation > >>>>>>>>> > errors > >>>>>>>>> > >> and pipe them into a Kafka topic. I want to propose that > >> we > >>>>>>> attach a > >>>>>>>>> > >> writer topology to the error and validation failed kafka > >>>>> topics > >>>>>> so > >>>>>>>>> > that we > >>>>>>>>> > >> can (a) create a new ES index for these errors and (b) > >>>>> create a > >>>>>>> new > >>>>>>>>> > Kibana > >>>>>>>>> > >> dashboard to visualize them. The benefit would be that > >>> errors > >>>>>> and > >>>>>>>>> > >> validation failures would be easier to see and analyze. > >>>>>>>>> > >> > >>>>>>>>> > >> I am seeking feedback on the following: > >>>>>>>>> > >> > >>>>>>>>> > >> - How granular would we want this feature to be? Think > > we > >>>>> would > >>>>>>> want > >>>>>>>>> > one > >>>>>>>>> > >> index/dashboard per source? Or would it be better to > >>> collapse > >>>>>>>>> > everything > >>>>>>>>> > >> into the same index? > >>>>>>>>> > >> - Do we care about storing these errors in HDFS as well? > >> Or > >>>>> is > >>>>>>>>> indexing > >>>>>>>>> > >> them enough? > >>>>>>>>> > >> - What types of errors should we record? I am proposing: > >>>>>>>>> > >> > >>>>>>>>> > >> For error reporting: > >>>>>>>>> > >> --Message failed to parse > >>>>>>>>> > >> --Enrichment failed to enrich > >>>>>>>>> > >> --Threat intel feed failures > >>>>>>>>> > >> --Generic catch-all for all other errors > >>>>>>>>> > >> > >>>>>>>>> > >> For validation reporting: > >>>>>>>>> > >> --What part of message failed validation > >>>>>>>>> > >> --What stellar validator caused the failure > >>>>>>>>> > >> > >>>>>>>>> > >> ------------------- > >>>>>>>>> > >> Thank you, > >>>>>>>>> > >> > >>>>>>>>> > >> James Sirota > >>>>>>>>> > >> PPMC- Apache Metron (Incubating) > >>>>>>>>> > >> jsirota AT apache DOT org > >>>>>>>>> > >> > >>>>>>>>> > >> -- > >>>>>>>>> > >> > >>>>>>>>> > >> Jon > >>>>>>>>> > >> > >>>>>>>>> > >> Sent from my mobile device > >>>>>>>>> > > > >>>>>>>>> > > ------------------- > >>>>>>>>> > > Thank you, > >>>>>>>>> > > > >>>>>>>>> > > James Sirota > >>>>>>>>> > > PPMC- Apache Metron (Incubating) > >>>>>>>>> > > jsirota AT apache DOT org > >>>>>>>>> > > > >>>>>>>>> > > -- > >>>>>>>>> > > > >>>>>>>>> > > Jon > >>>>>>>>> > > > >>>>>>>>> > > Sent from my mobile device > >>>>>>>>> > > >>>>>>>>> > ------------------- > >>>>>>>>> > Thank you, > >>>>>>>>> > > >>>>>>>>> > James Sirota > >>>>>>>>> > PPMC- Apache Metron (Incubating) > >>>>>>>>> > jsirota AT apache DOT org > >>>>>>>>> > > >>>>>>>>> -- > >>>>>>>>> > >>>>>>>>> Jon > >>>>>>>>> > >>>>>>>>> Sent from my mobile device > >>>>>>> ------------------- > >>>>>>> Thank you, > >>>>>>> > >>>>>>> James Sirota > >>>>>>> PPMC- Apache Metron (Incubating) > >>>>>>> jsirota AT apache DOT org > >>>>>>> > >>>>> -- > >>>>> > >>>>> Jon > >>>>> > >>>>> Sent from my mobile device > >>>>> > >>>> > >> -- > >> > >> Jon > >> > >> Sent from my mobile device > >> > > -- Jon Sent from my mobile device