As someone who currently fills the platform engineer role, I can give this idea a huge +1. My thoughts:
1. I think it depends on exactly what data is pushed into the index (#3). However, assuming the errors you proposed recording, I can't see huge benefits to having more than one dashboard. I would be happy to be persuaded otherwise. 2. I would say yes, storing the errors in HDFS in addition to indexing is a good thing. Using METRON-510 <https://issues.apache.org/jira/browse/METRON-510> as a case study, there is the potential in this environment for attacker-controlled data to result in processing errors which could be a method of evading security monitoring. Once an attack is identified, the long term HDFS storage would allow better historical analysis for low-and-slow/persistent attacks (I'm thinking of a method of data exfil that also won't successfully get stored in Lucene, but is hard to identify over a short period of time). - Along this line, I think that there are various parts of Metron (this included) which could benefit from having method of configuring data aging by bucket in HDFS (Following Nick's comments here <https://issues.apache.org/jira/browse/METRON-477>). 3. I would potentially add a hash of the content that failed validation to help identify repeats over time with less of a concern that you'd have back to back failures (i.e. instead of storing the value itself). Additionally, I think it's helpful to be able to search all times there was an indexing error (instead of it hitting the catch-all). Jon On Fri, Jan 20, 2017 at 1:17 PM James Sirota <jsir...@apache.org> wrote: We already have a capability to capture bolt errors and validation errors and pipe them into a Kafka topic. I want to propose that we attach a writer topology to the error and validation failed kafka topics so that we can (a) create a new ES index for these errors and (b) create a new Kibana dashboard to visualize them. The benefit would be that errors and validation failures would be easier to see and analyze. I am seeking feedback on the following: - How granular would we want this feature to be? Think we would want one index/dashboard per source? Or would it be better to collapse everything into the same index? - Do we care about storing these errors in HDFS as well? Or is indexing them enough? - What types of errors should we record? I am proposing: For error reporting: --Message failed to parse --Enrichment failed to enrich --Threat intel feed failures --Generic catch-all for all other errors For validation reporting: --What part of message failed validation --What stellar validator caused the failure ------------------- Thank you, James Sirota PPMC- Apache Metron (Incubating) jsirota AT apache DOT org -- Jon Sent from my mobile device