Re: [DISCUSS] System time vs. Event Time

zeo...@gmail.com Wed, 08 Mar 2017 09:21:06 -0800

I am a huge fan of these ideas and your comments Matt.  These use cases
have been in the back of my head for a while, so I'm happy to see them
getting discussed.  It would be a huge step forward for Metron capabilities.


I see connections between this discussion and both METRON-192
<https://issues.apache.org/jira/browse/METRON-192> and METRON-477
<https://issues.apache.org/jira/browse/METRON-477>, as well as a potential
for (c) to be improved in the longer term (i.e. not "minimum amount of
work") to be read in by sensors/forensics tools and not just get
re-parsed.  Maybe even during data expiration/transformation (as described
in METRON-477).

This is definitely not trivial.  That said, I expect that my team will be
working on METRON-477 in the near-medium term future and I want to make
sure that the effort that we undertake aligns with the outcome of this
discussion.

Jon

On Thu, Mar 2, 2017 at 5:24 PM Matt Foley <ma...@apache.org> wrote:

> Before the thought becomes obsolete, I’d like to say that I agree with
> Nick about the replay scenario and threat signature databases.  I think a
> principal use case is replaying old data with new threat signatures, to
> detect problems that were undetectable at the time they happened.  The use
> case Casey brought up, where you want to reproduce the exact behavior of an
> earlier PiT of your system, including using the threat signature database
> versions that were installed at that time, would also be useful for
> debugging, system understanding, and testing, but I think it is lower
> priority than the former.
>
> Another high priority use case is replaying data with new Profiler
> configurations, to answer questions that we hadn’t thought about asking
> before.
>
> So, Justin, I think the minimum amount of work for a useful batch process,
> is to:
> (a) Make sure event time rather than system time is usable, if not the
> default, in all components that record, manipulate, or select based on
> timestamps.
> (b) Enable a chunk of data, defined by our shiny new time window DSL, to
> be output in chron order from sources that store whole messages (HDFS,
> PCAP, maybe Solr/ES, maybe raw data files with a time window filter), and
> routed into a kafka topic, with throttling so kafka doesn’t try to swallow
> several TB at once.
> (c) Which can then be read by a Parser, and the result piped through the
> whole system, all the way to threat detection, profiling, and filtered
> re-recording.
> (d) The result set (in HDFS, ES, or Profiler) needs to remain “tagged”
> somehow with a batch identifier, both so it doesn’t get mixed up with all
> the other data from that event time, and so it can be bulk-deleted if you
> made a mistake and asked for TB’s of the wrong data.
>
> An interesting part of (c) is that we don’t really want the “batch” to
> interfere with on-going real-time processing.  Ideally the mechanism would
> also deal with data analysts submitting multiple batch requests at the same
> time (altho admittedly that could be handled with a queue).
>
> Is it sufficient to simply depend on the event time stamp to route stuff
> appropriately?  That doesn’t seem to meet (d).  We could effectively
> “virtualize” the batch job by suffixing the kafka topic names for the whole
> data flow related to a batch.  Batch id “foley3256”, being a bunch of bro
> messages, could enter the Bro Parser on topic bro_foley3256.  To carry this
> through to enrichment, etc., maybe it is sufficient to record the
> sensorType as “bro_foley3256”, or maybe it should be sensorType “bro” on
> kafka topic “enrichment_foley3256”.  Such schemes could satisfy (d) above,
> also.  Obviously there’s a lot of possible variations on this theme.  What
> do you think?
>
> --Matt
>
> On 3/2/17, 12:54 PM, "Justin Leet" <justinjl...@gmail.com> wrote:
>
>     I'm just going to throw out a few of questions, that I don't have good
>     answers to.  Casey and Nick, given your familiarity with the systems
>     involved, do you have any thoughts?
>
>        - What's the smallest unit of work we can do to enable at least a
> useful
>        subset of a fully featured term batch process? Looking at it from
> another
>        angle, which of the use cases (either that Nick listed, or that
> anyone else
>        has) gives us the best value for our effort?
>        - Can we also do things like limiting support for the
> interdependencies
>        Casey mentioned? If we do approach it that way, how do we avoid
> setting
>        ourselves up for issues parallelizing the more complicated cases?
> It
>        sounds like we'll need to brainstorm some of the dependency stuff
> anyway.
>        - Are there places right now (like the elasticsearch jira) where we
> need
>        or want to make changes to either fix, or improve, or enable some
> of the
>        larger pictures work?
>
>     Jon, any other thoughts?  Sounded like you were waiting to see how
> things
>     played out a bit, so if you have any insight, I'd love to hear it.
>
>     Justin
>
>     On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet <justinjl...@gmail.com>
> wrote:
>
>     > @Jon, it looks like it is based on system date.
>     >
>     > From ElasticsearchWriter.write:
>     > String indexPostfix = dateFormat.format(new Date());
>     > ...
>     > indexName = indexName + "_index_" + indexPostfix;
>     > ...
>     > IndexRequestBuilder indexRequestBuilder =
> client.prepareIndex(indexName,
>     > sensorType + "_doc");
>     >
>     > Justin
>     >
>     > On Tue, Feb 28, 2017 at 10:44 AM, zeo...@gmail.com <zeo...@gmail.com
> >
>     > wrote:
>     >
>     >> I'm actually a bit surprised to see METRON-691, because I know a
> while
>     >> back
>     >> I did some experiments to ensure that data was being written to the
>     >> indexes
>     >> that relate to the timestamp in the message, not the current time,
> and I
>     >> thought that messages were getting written to the proper historical
>     >> indexes, not the current one.  This was so long ago now, though,
> that it
>     >> would require another look, and I only reviewed it operationally
> (put
>     >> message on topic with certain timestamp, search for it in kibana).
>     >>
>     >> If that is not the case currently (which I should be able to verify
> later
>     >> this week) then that would be pretty concerning and somewhat
> separate from
>     >> the previous "Metron Batch" style discussions, which are more
> focused on
>     >> data bulk load or historical analysis.
>     >>
>     >> I will wait to see how the rest of this conversation pans out before
>     >> giving
>     >> my thoughts on the bigger picture.
>     >>
>     >> Jon
>     >>
>     >> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella <ceste...@gmail.com>
> wrote:
>     >>
>     >> > I think this is a really tricky topic, but necessary.  I've given
> it a
>     >> bit
>     >> > of thought over the last few months and I don't really see a
> great way
>     >> to
>     >> > do it given the Profiler.  Here's what I've come up with so far,
>     >> though, in
>     >> > my thinking.
>     >> >
>     >> >
>     >> >    - Replaying events will compress events in time (e.g. 2 years
> of data
>     >> >    may come through in 10 minutes)
>     >> >    - Replaying events may result in events being out of order
> temporally
>     >> >    even if it is written to kafka in order (just by virtue of
> hitting a
>     >> >    different kafka partition)
>     >> >
>     >> > Given both of these, in my mind we should handle replaying of
> data *not*
>     >> > within a streaming context so we can control the order and the
> grouping
>     >> of
>     >> > the data.  In my mind, this is essentially the advent of batch
> Metron.
>     >> Off
>     >> > the top of my head, I'm having trouble thinking about how to
> parallelize
>     >> > this, however, in a pretty manner.
>     >> >
>     >> > Imagine a scenario where telemetry A has an enrichment E1 that
> depends
>     >> on
>     >> > profile P1 and profile P1 depends on the previous 10 minutes of
> data.
>     >> How
>     >> > in a batch or streaming context can we ever hope to ensure that
> the
>     >> > profiles for P1 for the last 10 minutes are in place as data flows
>     >> through
>     >> > across all data points? Now how about if the values that P1
> depend on
>     >> are
>     >> > computed from a profile P2?  Essentially you have a data
> dependency
>     >> graph
>     >> > between enrichments and profiles and raw data that you need to
> work in
>     >> > order.
>     >> >
>     >> >
>     >> >
>     >> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet <
> justinjl...@gmail.com>
>     >> > wrote:
>     >> >
>     >> > > There's a couple JIRAs related to the use of system time vs
> event
>     >> time.
>     >> > >
>     >> > > METRON-590 Enable Use of Event Time in Profiler
>     >> > > <https://issues.apache.org/jira/browse/METRON-590>
>     >> > > METRON-691 Elastic Writer index partitions on system time, not
> event
>     >> time
>     >> > > <https://issues.apache.org/jira/browse/METRON-691>
>     >> > >
>     >> > > Is there anything else that needs to be making this
> distinction, and
>     >> if
>     >> > so,
>     >> > > do we need to be able to support both system time and event
> time for
>     >> it?
>     >> > >
>     >> > > My immediate thought on this is that, once we work on replaying
>     >> > historical
>     >> > > data, we'll want system time for geo data passing through.
> Given that
>     >> > the
>     >> > > geo files can update, we'd want to know which geo file we
> actually
>     >> need
>     >> > to
>     >> > > be using at the appropriate time.
>     >> > >
>     >> > > We'll probably also want to double check anything else that
> writes out
>     >> > data
>     >> > > to a location and provides some sort of timestamping on it.
>     >> > >
>     >> > > Justin
>     >> > >
>     >> >
>     >> --
>     >>
>     >> Jon
>     >>
>     >> Sent from my mobile device
>     >>
>     >
>     >
>
>
>
> --

Jon

Re: [DISCUSS] System time vs. Event Time

Reply via email to