We also need a JIRA for any install/Ansible/MPack work needed. On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <jsir...@apache.org> wrote:
> Now that I had some time to think about it I would collapse all error and > validation topics into one. We can differentiate between different views > of the data (split by error source etc) via Kibana dashboards. I would > implement this feature incrementally. First I would modify all the bolts > to log to a single topic. Second, I would get the error indexing done by > attaching the indexing topology to the error topic. Third I would create > the necessary dashboards to view errors and validation failures by source. > Lastly, I would file a follow-on JIRA to introduce hashing of errors or > fields that are too long. It seems like a separate feature that we need to > think through. We may need a stellar function around that. > > Thanks, > James > > 24.01.2017, 10:25, "Ryan Merriman" <merrim...@gmail.com>: > > I understand what Jon is talking about. He's proposing we hash the value > > that caused the error, not necessarily the error message itself. For an > > enrichment this is easy. Just pass along the field value that failed > > enrichment. For other cases the field that caused the error may not be so > > obvious. Take parser validation for example. The message is validated as > > a whole and it may not be easy to determine which field is the cause. In > > that case would a hash of the whole message work? > > > > There is a broader architectural discussion that needs to happen before > we > > can implement this. Currently we have an indexing topology that reads > from > > 1 topic and writes messages to ES but errors are written to several > > different topics: > > > > - parser_error > > - parser_invalid > > - enrichments_error > > - threatintel_error > > - indexing_error > > > > I can see 4 possible approaches to implementing this: > > > > 1. Create an index topology for each error topic > > 1. Good because we can easily reuse the indexing topology and would > > require the least development effort > > 2. Bad because it would consume a lot of extra worker slots > > 2. Move the topic name into the error JSON message as a new > "error_type" > > field and write all messages to the indexing topic > > 1. Good because we don't need to create a new topology > > 2. Bad because we would be flowing data and errors through the same > > topology. A spike in errors could affect message indexing. > > 3. Compromise between 1 and 2. Create another indexing topology that > is > > dedicated to indexing errors. Move the topic name into the error JSON > > message as a new "error_type" field and write all errors to a single > error > > topic. > > 4. Write a completely new topology with multiple spouts (1 for each > > error type listed above) that all feed into a single > BulkMessageWriterBolt. > > 1. Good because the current topologies would not need to change > > 2. Bad because it would require the most development effort, would > > not reuse existing topologies and takes up more worker slots than 3 > > > > Are there other approaches I haven't thought of? I think 1 and 2 are off > > the table because they are shortcuts and not good long-term solutions. 3 > > would be my choice because it introduces less complexity than 4. > Thoughts? > > > > Ryan > > > > On Mon, Jan 23, 2017 at 5:44 PM, zeo...@gmail.com <zeo...@gmail.com> > wrote: > > > >> In that case the hash would be of the value in the IP field, such as > >> sha3(8.8.8.8). > >> > >> Jon > >> > >> On Mon, Jan 23, 2017, 6:41 PM James Sirota <jsir...@apache.org> wrote: > >> > >> > Jon, > >> > > >> > I am still not entirely following why we would want to use hashing. > For > >> > example if my error is "Your IP field is invalid and failed > validation" > >> > hashing this error string will always result in the same hash. Why > not > >> > just use the actual error string? Can you provide an example where > you > >> > would use it? > >> > > >> > Thanks, > >> > James > >> > > >> > 23.01.2017, 16:29, "zeo...@gmail.com" <zeo...@gmail.com>: > >> > > For 1 - I'm good with that. > >> > > > >> > > I'm talking about hashing the relevant content itself not the > error. > >> Some > >> > > benefits are (1) minimize load on search index (there's minimal > benefit > >> > in > >> > > spending the CPU and disk to keep it at full fidelity (tokenize and > >> > store)) > >> > > (2) provide something to key on for dashboards (assuming a good > hash > >> > > algorithm that avoids collisions and is second preimage resistant) > and > >> > (3) > >> > > specific to errors, if the issue is that it failed to index, a hash > >> gives > >> > > us some protection that the issue will not occur twice. > >> > > > >> > > Jon > >> > > > >> > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <jsir...@apache.org> > wrote: > >> > > > >> > > Jon, > >> > > > >> > > With regards to 1, collapsing to a single dashboard for each would > be > >> > > fine. So we would have one error index and one "failed to validate" > >> > > index. The distinction is that errors would be things that went > wrong > >> > > during stream processing (failed to parse, etc...), while > validation > >> > > failures are messages that explicitly failed stellar > validation/schema > >> > > enforcement. There should be relatively few of the second type. > >> > > > >> > > With respect to 3, why do you want the error hashed? Why not just > >> search > >> > > for the error text? > >> > > > >> > > Thanks, > >> > > James > >> > > > >> > > 20.01.2017, 14:01, "zeo...@gmail.com" <zeo...@gmail.com>: > >> > >> As someone who currently fills the platform engineer role, I can > give > >> > this > >> > >> idea a huge +1. My thoughts: > >> > >> > >> > >> 1. I think it depends on exactly what data is pushed into the > index > >> > (#3). > >> > >> However, assuming the errors you proposed recording, I can't see > huge > >> > >> benefits to having more than one dashboard. I would be happy to be > >> > >> persuaded otherwise. > >> > >> > >> > >> 2. I would say yes, storing the errors in HDFS in addition to > >> indexing > >> > is > >> > >> a good thing. Using METRON-510 > >> > >> <https://issues.apache.org/jira/browse/METRON-510> as a case > study, > >> > there > >> > >> is the potential in this environment for attacker-controlled data > to > >> > > > >> > > result > >> > >> in processing errors which could be a method of evading security > >> > >> monitoring. Once an attack is identified, the long term HDFS > storage > >> > would > >> > >> allow better historical analysis for low-and-slow/persistent > attacks > >> > (I'm > >> > >> thinking of a method of data exfil that also won't successfully > get > >> > stored > >> > >> in Lucene, but is hard to identify over a short period of time). > >> > >> - Along this line, I think that there are various parts of Metron > >> > (this > >> > >> included) which could benefit from having method of configuring > data > >> > aging > >> > >> by bucket in HDFS (Following Nick's comments here > >> > >> <https://issues.apache.org/jira/browse/METRON-477>). > >> > >> > >> > >> 3. I would potentially add a hash of the content that failed > >> > validation to > >> > >> help identify repeats over time with less of a concern that you'd > >> have > >> > > > >> > > back > >> > >> to back failures (i.e. instead of storing the value itself). > >> > Additionally, > >> > >> I think it's helpful to be able to search all times there was an > >> > indexing > >> > >> error (instead of it hitting the catch-all). > >> > >> > >> > >> Jon > >> > >> > >> > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <jsir...@apache.org> > >> > wrote: > >> > >> > >> > >> We already have a capability to capture bolt errors and validation > >> > errors > >> > >> and pipe them into a Kafka topic. I want to propose that we > attach a > >> > >> writer topology to the error and validation failed kafka topics so > >> > that we > >> > >> can (a) create a new ES index for these errors and (b) create a > new > >> > Kibana > >> > >> dashboard to visualize them. The benefit would be that errors and > >> > >> validation failures would be easier to see and analyze. > >> > >> > >> > >> I am seeking feedback on the following: > >> > >> > >> > >> - How granular would we want this feature to be? Think we would > want > >> > one > >> > >> index/dashboard per source? Or would it be better to collapse > >> > everything > >> > >> into the same index? > >> > >> - Do we care about storing these errors in HDFS as well? Or is > >> indexing > >> > >> them enough? > >> > >> - What types of errors should we record? I am proposing: > >> > >> > >> > >> For error reporting: > >> > >> --Message failed to parse > >> > >> --Enrichment failed to enrich > >> > >> --Threat intel feed failures > >> > >> --Generic catch-all for all other errors > >> > >> > >> > >> For validation reporting: > >> > >> --What part of message failed validation > >> > >> --What stellar validator caused the failure > >> > >> > >> > >> ------------------- > >> > >> Thank you, > >> > >> > >> > >> James Sirota > >> > >> PPMC- Apache Metron (Incubating) > >> > >> jsirota AT apache DOT org > >> > >> > >> > >> -- > >> > >> > >> > >> Jon > >> > >> > >> > >> Sent from my mobile device > >> > > > >> > > ------------------- > >> > > Thank you, > >> > > > >> > > James Sirota > >> > > PPMC- Apache Metron (Incubating) > >> > > jsirota AT apache DOT org > >> > > > >> > > -- > >> > > > >> > > Jon > >> > > > >> > > Sent from my mobile device > >> > > >> > ------------------- > >> > Thank you, > >> > > >> > James Sirota > >> > PPMC- Apache Metron (Incubating) > >> > jsirota AT apache DOT org > >> > > >> -- > >> > >> Jon > >> > >> Sent from my mobile device > > ------------------- > Thank you, > > James Sirota > PPMC- Apache Metron (Incubating) > jsirota AT apache DOT org >