Re: [DISCUSS] Error Indexing

David Lyle Wed, 25 Jan 2017 07:49:27 -0800

RE: separate JIRA for MPack/Ansible. No objection to tracking them
separately, but for this item to be complete, you'll need both the feature
and the ability to install it.


-D...


On Tue, Jan 24, 2017 at 5:33 PM, Ryan Merriman <merrim...@gmail.com> wrote:

> Assuming we're going to write all errors to a single error topic, I think
> it makes sense to agree on an error message schema and handle errors across
> the 3 different topologies in the same way with a single implementation.
> The implementation in ParserBolt (ErrorUtils.handleError) produces the most
> verbose error object so I think it's a good candidate for the single
> implementation.  Here is the message structure it currently produces:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error"
> }
>
> From our discussion so far we need to add a couple fields:  an error type
> and hash id.  Adding these to the message looks like:
>
> {
>   "exception": "java.lang.Exception: there was an error",
>   "hostname": "host",
>   "stack": "java.lang.Exception: ...",
>   "time": 1485295416563,
>   "message": "there was an error",
>   "rawMessage": "raw message",
>   "rawMessage_bytes": [],
>   "source.type": "bro_error",
>   "error.type": "parser_error",
>   "rawMessage_hash": "dde41b9920954f94066daf6291fb58a9"
> }
>
> We should also consider expanding the error types I listed earlier.
> Instead of just having "indexing_error" we could have
> "elasticsearch_indexing_error", "hdfs_indexing_error" and so on.
>
> Jon, if an exception happens in an enrichment or threat intel bolt the
> message is passed along with no error thrown (only logged).  Everywhere
> else I'm having trouble identifying specific fields that should be hashed.
> Would hashing the message in every case be acceptable?  Do you know of a
> place where we could hash a field instead?  On the topic of exceptions in
> enrichments, are we ok with an error only being logged and not added to the
> message or emitted to the error queue?
>
>
>
> On Tue, Jan 24, 2017 at 3:10 PM, Ryan Merriman <merrim...@gmail.com>
> wrote:
>
> > That use case makes sense to me.  I don't think it will require that much
> > additional effort either.
> >
> > On Tue, Jan 24, 2017 at 1:02 PM, zeo...@gmail.com <zeo...@gmail.com>
> > wrote:
> >
> >> Regarding error vs validation - Either way I'm not very concerned.  I
> >> initially assumed they would be combined and agree with that approach,
> but
> >> splitting them out isn't a very big deal to me either.
> >>
> >> Re: Ryan.  Yes, exactly.  In the case of a parser issue (or anywhere
> else
> >> where it's not possible to pick out the exact thing causing the issue)
> it
> >> would be a hash of the complete message.
> >>
> >> Regarding the architecture, I mostly agree with James except that I
> think
> >> step 3 needs to also be able to somehow group errors via the original
> >> data (identify
> >> replays, identify repeat issues with data in a specific field, issues
> with
> >> consistently different data, etc.).  This is essentially the first step
> of
> >> troubleshooting, which I assume you are doing if you're looking at the
> >> error dashboard.
> >>
> >> If the hash gets moved out of the initial implementation, I'm fairly
> >> certain you lose this ability.  The point here isn't to handle long
> fields
> >> (although that's a benefit of this approach), it's to attach a unique
> >> identifier to the error/validation issue message that links it to the
> >> original problem.  I'd be happy to consider alternative solutions to
> this
> >> problem (for instance, actually sending across the data itself) I just
> >> haven't been able to think of another way to do this that I like better.
> >>
> >> Jon
> >>
> >> On Tue, Jan 24, 2017 at 1:13 PM Ryan Merriman <merrim...@gmail.com>
> >> wrote:
> >>
> >> > We also need a JIRA for any install/Ansible/MPack work needed.
> >> >
> >> > On Tue, Jan 24, 2017 at 12:06 PM, James Sirota <jsir...@apache.org>
> >> wrote:
> >> >
> >> > > Now that I had some time to think about it I would collapse all
> error
> >> and
> >> > > validation topics into one.  We can differentiate between different
> >> views
> >> > > of the data (split by error source etc) via Kibana dashboards.  I
> >> would
> >> > > implement this feature incrementally.  First I would modify all the
> >> bolts
> >> > > to log to a single topic.  Second, I would get the error indexing
> >> done by
> >> > > attaching the indexing topology to the error topic. Third I would
> >> create
> >> > > the necessary dashboards to view errors and validation failures by
> >> > source.
> >> > > Lastly, I would file a follow-on JIRA to introduce hashing of errors
> >> or
> >> > > fields that are too long.  It seems like a separate feature that we
> >> need
> >> > to
> >> > > think through.  We may need a stellar function around that.
> >> > >
> >> > > Thanks,
> >> > > James
> >> > >
> >> > > 24.01.2017, 10:25, "Ryan Merriman" <merrim...@gmail.com>:
> >> > > > I understand what Jon is talking about. He's proposing we hash the
> >> > value
> >> > > > that caused the error, not necessarily the error message itself.
> >> For an
> >> > > > enrichment this is easy. Just pass along the field value that
> failed
> >> > > > enrichment. For other cases the field that caused the error may
> not
> >> be
> >> > so
> >> > > > obvious. Take parser validation for example. The message is
> >> validated
> >> > as
> >> > > > a whole and it may not be easy to determine which field is the
> >> cause.
> >> > In
> >> > > > that case would a hash of the whole message work?
> >> > > >
> >> > > > There is a broader architectural discussion that needs to happen
> >> before
> >> > > we
> >> > > > can implement this. Currently we have an indexing topology that
> >> reads
> >> > > from
> >> > > > 1 topic and writes messages to ES but errors are written to
> several
> >> > > > different topics:
> >> > > >
> >> > > >    - parser_error
> >> > > >    - parser_invalid
> >> > > >    - enrichments_error
> >> > > >    - threatintel_error
> >> > > >    - indexing_error
> >> > > >
> >> > > > I can see 4 possible approaches to implementing this:
> >> > > >
> >> > > >    1. Create an index topology for each error topic
> >> > > >       1. Good because we can easily reuse the indexing topology
> and
> >> > would
> >> > > >       require the least development effort
> >> > > >       2. Bad because it would consume a lot of extra worker slots
> >> > > >    2. Move the topic name into the error JSON message as a new
> >> > > "error_type"
> >> > > >    field and write all messages to the indexing topic
> >> > > >       1. Good because we don't need to create a new topology
> >> > > >       2. Bad because we would be flowing data and errors through
> the
> >> > same
> >> > > >       topology. A spike in errors could affect message indexing.
> >> > > >    3. Compromise between 1 and 2. Create another indexing topology
> >> that
> >> > > is
> >> > > >    dedicated to indexing errors. Move the topic name into the
> error
> >> > JSON
> >> > > >    message as a new "error_type" field and write all errors to a
> >> single
> >> > > error
> >> > > >    topic.
> >> > > >    4. Write a completely new topology with multiple spouts (1 for
> >> each
> >> > > >    error type listed above) that all feed into a single
> >> > > BulkMessageWriterBolt.
> >> > > >       1. Good because the current topologies would not need to
> >> change
> >> > > >       2. Bad because it would require the most development effort,
> >> > would
> >> > > >       not reuse existing topologies and takes up more worker slots
> >> > than 3
> >> > > >
> >> > > > Are there other approaches I haven't thought of? I think 1 and 2
> are
> >> > off
> >> > > > the table because they are shortcuts and not good long-term
> >> solutions.
> >> > 3
> >> > > > would be my choice because it introduces less complexity than 4.
> >> > > Thoughts?
> >> > > >
> >> > > > Ryan
> >> > > >
> >> > > > On Mon, Jan 23, 2017 at 5:44 PM, zeo...@gmail.com <
> zeo...@gmail.com
> >> >
> >> > > wrote:
> >> > > >
> >> > > >>  In that case the hash would be of the value in the IP field,
> such
> >> as
> >> > > >>  sha3(8.8.8.8).
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  On Mon, Jan 23, 2017, 6:41 PM James Sirota <jsir...@apache.org>
> >> > wrote:
> >> > > >>
> >> > > >>  > Jon,
> >> > > >>  >
> >> > > >>  > I am still not entirely following why we would want to use
> >> hashing.
> >> > > For
> >> > > >>  > example if my error is "Your IP field is invalid and failed
> >> > > validation"
> >> > > >>  > hashing this error string will always result in the same hash.
> >> Why
> >> > > not
> >> > > >>  > just use the actual error string? Can you provide an example
> >> where
> >> > > you
> >> > > >>  > would use it?
> >> > > >>  >
> >> > > >>  > Thanks,
> >> > > >>  > James
> >> > > >>  >
> >> > > >>  > 23.01.2017, 16:29, "zeo...@gmail.com" <zeo...@gmail.com>:
> >> > > >>  > > For 1 - I'm good with that.
> >> > > >>  > >
> >> > > >>  > > I'm talking about hashing the relevant content itself not
> the
> >> > > error.
> >> > > >>  Some
> >> > > >>  > > benefits are (1) minimize load on search index (there's
> >> minimal
> >> > > benefit
> >> > > >>  > in
> >> > > >>  > > spending the CPU and disk to keep it at full fidelity
> >> (tokenize
> >> > and
> >> > > >>  > store))
> >> > > >>  > > (2) provide something to key on for dashboards (assuming a
> >> good
> >> > > hash
> >> > > >>  > > algorithm that avoids collisions and is second preimage
> >> > resistant)
> >> > > and
> >> > > >>  > (3)
> >> > > >>  > > specific to errors, if the issue is that it failed to
> index, a
> >> > hash
> >> > > >>  gives
> >> > > >>  > > us some protection that the issue will not occur twice.
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > On Mon, Jan 23, 2017, 2:47 PM James Sirota <
> >> jsir...@apache.org>
> >> > > wrote:
> >> > > >>  > >
> >> > > >>  > > Jon,
> >> > > >>  > >
> >> > > >>  > > With regards to 1, collapsing to a single dashboard for each
> >> > would
> >> > > be
> >> > > >>  > > fine. So we would have one error index and one "failed to
> >> > validate"
> >> > > >>  > > index. The distinction is that errors would be things that
> >> went
> >> > > wrong
> >> > > >>  > > during stream processing (failed to parse, etc...), while
> >> > > validation
> >> > > >>  > > failures are messages that explicitly failed stellar
> >> > > validation/schema
> >> > > >>  > > enforcement. There should be relatively few of the second
> >> type.
> >> > > >>  > >
> >> > > >>  > > With respect to 3, why do you want the error hashed? Why not
> >> just
> >> > > >>  search
> >> > > >>  > > for the error text?
> >> > > >>  > >
> >> > > >>  > > Thanks,
> >> > > >>  > > James
> >> > > >>  > >
> >> > > >>  > > 20.01.2017, 14:01, "zeo...@gmail.com" <zeo...@gmail.com>:
> >> > > >>  > >> As someone who currently fills the platform engineer role,
> I
> >> can
> >> > > give
> >> > > >>  > this
> >> > > >>  > >> idea a huge +1. My thoughts:
> >> > > >>  > >>
> >> > > >>  > >> 1. I think it depends on exactly what data is pushed into
> the
> >> > > index
> >> > > >>  > (#3).
> >> > > >>  > >> However, assuming the errors you proposed recording, I
> can't
> >> see
> >> > > huge
> >> > > >>  > >> benefits to having more than one dashboard. I would be
> happy
> >> to
> >> > be
> >> > > >>  > >> persuaded otherwise.
> >> > > >>  > >>
> >> > > >>  > >> 2. I would say yes, storing the errors in HDFS in addition
> to
> >> > > >>  indexing
> >> > > >>  > is
> >> > > >>  > >> a good thing. Using METRON-510
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-510> as a
> case
> >> > > study,
> >> > > >>  > there
> >> > > >>  > >> is the potential in this environment for
> attacker-controlled
> >> > data
> >> > > to
> >> > > >>  > >
> >> > > >>  > > result
> >> > > >>  > >> in processing errors which could be a method of evading
> >> security
> >> > > >>  > >> monitoring. Once an attack is identified, the long term
> HDFS
> >> > > storage
> >> > > >>  > would
> >> > > >>  > >> allow better historical analysis for
> low-and-slow/persistent
> >> > > attacks
> >> > > >>  > (I'm
> >> > > >>  > >> thinking of a method of data exfil that also won't
> >> successfully
> >> > > get
> >> > > >>  > stored
> >> > > >>  > >> in Lucene, but is hard to identify over a short period of
> >> time).
> >> > > >>  > >> - Along this line, I think that there are various parts of
> >> > Metron
> >> > > >>  > (this
> >> > > >>  > >> included) which could benefit from having method of
> >> configuring
> >> > > data
> >> > > >>  > aging
> >> > > >>  > >> by bucket in HDFS (Following Nick's comments here
> >> > > >>  > >> <https://issues.apache.org/jira/browse/METRON-477>).
> >> > > >>  > >>
> >> > > >>  > >> 3. I would potentially add a hash of the content that
> failed
> >> > > >>  > validation to
> >> > > >>  > >> help identify repeats over time with less of a concern that
> >> > you'd
> >> > > >>  have
> >> > > >>  > >
> >> > > >>  > > back
> >> > > >>  > >> to back failures (i.e. instead of storing the value
> itself).
> >> > > >>  > Additionally,
> >> > > >>  > >> I think it's helpful to be able to search all times there
> >> was an
> >> > > >>  > indexing
> >> > > >>  > >> error (instead of it hitting the catch-all).
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> On Fri, Jan 20, 2017 at 1:17 PM James Sirota <
> >> > jsir...@apache.org>
> >> > > >>  > wrote:
> >> > > >>  > >>
> >> > > >>  > >> We already have a capability to capture bolt errors and
> >> > validation
> >> > > >>  > errors
> >> > > >>  > >> and pipe them into a Kafka topic. I want to propose that we
> >> > > attach a
> >> > > >>  > >> writer topology to the error and validation failed kafka
> >> topics
> >> > so
> >> > > >>  > that we
> >> > > >>  > >> can (a) create a new ES index for these errors and (b)
> >> create a
> >> > > new
> >> > > >>  > Kibana
> >> > > >>  > >> dashboard to visualize them. The benefit would be that
> errors
> >> > and
> >> > > >>  > >> validation failures would be easier to see and analyze.
> >> > > >>  > >>
> >> > > >>  > >> I am seeking feedback on the following:
> >> > > >>  > >>
> >> > > >>  > >> - How granular would we want this feature to be? Think we
> >> would
> >> > > want
> >> > > >>  > one
> >> > > >>  > >> index/dashboard per source? Or would it be better to
> collapse
> >> > > >>  > everything
> >> > > >>  > >> into the same index?
> >> > > >>  > >> - Do we care about storing these errors in HDFS as well? Or
> >> is
> >> > > >>  indexing
> >> > > >>  > >> them enough?
> >> > > >>  > >> - What types of errors should we record? I am proposing:
> >> > > >>  > >>
> >> > > >>  > >> For error reporting:
> >> > > >>  > >> --Message failed to parse
> >> > > >>  > >> --Enrichment failed to enrich
> >> > > >>  > >> --Threat intel feed failures
> >> > > >>  > >> --Generic catch-all for all other errors
> >> > > >>  > >>
> >> > > >>  > >> For validation reporting:
> >> > > >>  > >> --What part of message failed validation
> >> > > >>  > >> --What stellar validator caused the failure
> >> > > >>  > >>
> >> > > >>  > >> -------------------
> >> > > >>  > >> Thank you,
> >> > > >>  > >>
> >> > > >>  > >> James Sirota
> >> > > >>  > >> PPMC- Apache Metron (Incubating)
> >> > > >>  > >> jsirota AT apache DOT org
> >> > > >>  > >>
> >> > > >>  > >> --
> >> > > >>  > >>
> >> > > >>  > >> Jon
> >> > > >>  > >>
> >> > > >>  > >> Sent from my mobile device
> >> > > >>  > >
> >> > > >>  > > -------------------
> >> > > >>  > > Thank you,
> >> > > >>  > >
> >> > > >>  > > James Sirota
> >> > > >>  > > PPMC- Apache Metron (Incubating)
> >> > > >>  > > jsirota AT apache DOT org
> >> > > >>  > >
> >> > > >>  > > --
> >> > > >>  > >
> >> > > >>  > > Jon
> >> > > >>  > >
> >> > > >>  > > Sent from my mobile device
> >> > > >>  >
> >> > > >>  > -------------------
> >> > > >>  > Thank you,
> >> > > >>  >
> >> > > >>  > James Sirota
> >> > > >>  > PPMC- Apache Metron (Incubating)
> >> > > >>  > jsirota AT apache DOT org
> >> > > >>  >
> >> > > >>  --
> >> > > >>
> >> > > >>  Jon
> >> > > >>
> >> > > >>  Sent from my mobile device
> >> > >
> >> > > -------------------
> >> > > Thank you,
> >> > >
> >> > > James Sirota
> >> > > PPMC- Apache Metron (Incubating)
> >> > > jsirota AT apache DOT org
> >> > >
> >> >
> >> --
> >>
> >> Jon
> >>
> >> Sent from my mobile device
> >>
> >
> >
>

Re: [DISCUSS] Error Indexing

Reply via email to