As someone who currently fills the platform engineer role, I can give this
idea a huge +1.  My thoughts:

1.  I think it depends on exactly what data is pushed into the index (#3).
However, assuming the errors you proposed recording, I can't see huge
benefits to having more than one dashboard.  I would be happy to be
persuaded otherwise.

2.  I would say yes, storing the errors in HDFS in addition to indexing is
a good thing.  Using METRON-510
<https://issues.apache.org/jira/browse/METRON-510> as a case study, there
is the potential in this environment for attacker-controlled data to result
in processing errors which could be a method of evading security
monitoring.  Once an attack is identified, the long term HDFS storage would
allow better historical analysis for low-and-slow/persistent attacks (I'm
thinking of a method of data exfil that also won't successfully get stored
in Lucene, but is hard to identify over a short period of time).
 - Along this line, I think that there are various parts of Metron (this
included) which could benefit from having method of configuring data aging
by bucket in HDFS (Following Nick's comments here
<https://issues.apache.org/jira/browse/METRON-477>).

3.  I would potentially add a hash of the content that failed validation to
help identify repeats over time with less of a concern that you'd have back
to back failures (i.e. instead of storing the value itself).  Additionally,
I think it's helpful to be able to search all times there was an indexing
error (instead of it hitting the catch-all).

Jon

On Fri, Jan 20, 2017 at 1:17 PM James Sirota <jsir...@apache.org> wrote:

We already have a capability to capture bolt errors and validation errors
and pipe them into a Kafka topic.  I want to propose that we attach a
writer topology to the error and validation failed kafka topics so that we
can (a) create a new ES index for these errors and (b) create a new Kibana
dashboard to visualize them.  The benefit would be that errors and
validation failures would be easier to see and analyze.

I am seeking feedback on the following:

- How granular would we want this feature to be?  Think we would want one
index/dashboard per source?  Or would it be better to collapse everything
into the same index?
- Do we care about storing these errors in HDFS as well?  Or is indexing
them enough?
- What types of errors should we record?  I am proposing:

For error reporting:
--Message failed to parse
--Enrichment failed to enrich
--Threat intel feed failures
--Generic catch-all for all other errors

For validation reporting:
--What part of message failed validation
--What stellar validator caused the failure



-------------------
Thank you,

James Sirota
PPMC- Apache Metron (Incubating)
jsirota AT apache DOT org

-- 

Jon

Sent from my mobile device

Reply via email to