Re: [DISCUSS] System time vs. Event Time

2017-03-08 Thread zeo...@gmail.com
I am a huge fan of these ideas and your comments Matt.  These use cases
have been in the back of my head for a while, so I'm happy to see them
getting discussed.  It would be a huge step forward for Metron capabilities.

I see connections between this discussion and both METRON-192
 and METRON-477
, as well as a potential
for (c) to be improved in the longer term (i.e. not "minimum amount of
work") to be read in by sensors/forensics tools and not just get
re-parsed.  Maybe even during data expiration/transformation (as described
in METRON-477).

This is definitely not trivial.  That said, I expect that my team will be
working on METRON-477 in the near-medium term future and I want to make
sure that the effort that we undertake aligns with the outcome of this
discussion.

Jon

On Thu, Mar 2, 2017 at 5:24 PM Matt Foley  wrote:

> Before the thought becomes obsolete, I’d like to say that I agree with
> Nick about the replay scenario and threat signature databases.  I think a
> principal use case is replaying old data with new threat signatures, to
> detect problems that were undetectable at the time they happened.  The use
> case Casey brought up, where you want to reproduce the exact behavior of an
> earlier PiT of your system, including using the threat signature database
> versions that were installed at that time, would also be useful for
> debugging, system understanding, and testing, but I think it is lower
> priority than the former.
>
> Another high priority use case is replaying data with new Profiler
> configurations, to answer questions that we hadn’t thought about asking
> before.
>
> So, Justin, I think the minimum amount of work for a useful batch process,
> is to:
> (a) Make sure event time rather than system time is usable, if not the
> default, in all components that record, manipulate, or select based on
> timestamps.
> (b) Enable a chunk of data, defined by our shiny new time window DSL, to
> be output in chron order from sources that store whole messages (HDFS,
> PCAP, maybe Solr/ES, maybe raw data files with a time window filter), and
> routed into a kafka topic, with throttling so kafka doesn’t try to swallow
> several TB at once.
> (c) Which can then be read by a Parser, and the result piped through the
> whole system, all the way to threat detection, profiling, and filtered
> re-recording.
> (d) The result set (in HDFS, ES, or Profiler) needs to remain “tagged”
> somehow with a batch identifier, both so it doesn’t get mixed up with all
> the other data from that event time, and so it can be bulk-deleted if you
> made a mistake and asked for TB’s of the wrong data.
>
> An interesting part of (c) is that we don’t really want the “batch” to
> interfere with on-going real-time processing.  Ideally the mechanism would
> also deal with data analysts submitting multiple batch requests at the same
> time (altho admittedly that could be handled with a queue).
>
> Is it sufficient to simply depend on the event time stamp to route stuff
> appropriately?  That doesn’t seem to meet (d).  We could effectively
> “virtualize” the batch job by suffixing the kafka topic names for the whole
> data flow related to a batch.  Batch id “foley3256”, being a bunch of bro
> messages, could enter the Bro Parser on topic bro_foley3256.  To carry this
> through to enrichment, etc., maybe it is sufficient to record the
> sensorType as “bro_foley3256”, or maybe it should be sensorType “bro” on
> kafka topic “enrichment_foley3256”.  Such schemes could satisfy (d) above,
> also.  Obviously there’s a lot of possible variations on this theme.  What
> do you think?
>
> --Matt
>
> On 3/2/17, 12:54 PM, "Justin Leet"  wrote:
>
> I'm just going to throw out a few of questions, that I don't have good
> answers to.  Casey and Nick, given your familiarity with the systems
> involved, do you have any thoughts?
>
>- What's the smallest unit of work we can do to enable at least a
> useful
>subset of a fully featured term batch process? Looking at it from
> another
>angle, which of the use cases (either that Nick listed, or that
> anyone else
>has) gives us the best value for our effort?
>- Can we also do things like limiting support for the
> interdependencies
>Casey mentioned? If we do approach it that way, how do we avoid
> setting
>ourselves up for issues parallelizing the more complicated cases?
> It
>sounds like we'll need to brainstorm some of the dependency stuff
> anyway.
>- Are there places right now (like the elasticsearch jira) where we
> need
>or want to make changes to either fix, or improve, or enable some
> of the
>larger pictures work?
>
> Jon, any other thoughts?  Sounded like you were waiting to see how
> things
> played out a bit, so if you have any insight, I'd love 

Re: [DISCUSS] System time vs. Event Time

2017-03-02 Thread Matt Foley
Before the thought becomes obsolete, I’d like to say that I agree with Nick 
about the replay scenario and threat signature databases.  I think a principal 
use case is replaying old data with new threat signatures, to detect problems 
that were undetectable at the time they happened.  The use case Casey brought 
up, where you want to reproduce the exact behavior of an earlier PiT of your 
system, including using the threat signature database versions that were 
installed at that time, would also be useful for debugging, system 
understanding, and testing, but I think it is lower priority than the former.

Another high priority use case is replaying data with new Profiler 
configurations, to answer questions that we hadn’t thought about asking before.

So, Justin, I think the minimum amount of work for a useful batch process, is 
to:
(a) Make sure event time rather than system time is usable, if not the default, 
in all components that record, manipulate, or select based on timestamps.
(b) Enable a chunk of data, defined by our shiny new time window DSL, to be 
output in chron order from sources that store whole messages (HDFS, PCAP, maybe 
Solr/ES, maybe raw data files with a time window filter), and routed into a 
kafka topic, with throttling so kafka doesn’t try to swallow several TB at once.
(c) Which can then be read by a Parser, and the result piped through the whole 
system, all the way to threat detection, profiling, and filtered re-recording.
(d) The result set (in HDFS, ES, or Profiler) needs to remain “tagged” somehow 
with a batch identifier, both so it doesn’t get mixed up with all the other 
data from that event time, and so it can be bulk-deleted if you made a mistake 
and asked for TB’s of the wrong data.

An interesting part of (c) is that we don’t really want the “batch” to 
interfere with on-going real-time processing.  Ideally the mechanism would also 
deal with data analysts submitting multiple batch requests at the same time 
(altho admittedly that could be handled with a queue).

Is it sufficient to simply depend on the event time stamp to route stuff 
appropriately?  That doesn’t seem to meet (d).  We could effectively 
“virtualize” the batch job by suffixing the kafka topic names for the whole 
data flow related to a batch.  Batch id “foley3256”, being a bunch of bro 
messages, could enter the Bro Parser on topic bro_foley3256.  To carry this 
through to enrichment, etc., maybe it is sufficient to record the sensorType as 
“bro_foley3256”, or maybe it should be sensorType “bro” on kafka topic 
“enrichment_foley3256”.  Such schemes could satisfy (d) above, also.  Obviously 
there’s a lot of possible variations on this theme.  What do you think?

--Matt

On 3/2/17, 12:54 PM, "Justin Leet"  wrote:

I'm just going to throw out a few of questions, that I don't have good
answers to.  Casey and Nick, given your familiarity with the systems
involved, do you have any thoughts?

   - What's the smallest unit of work we can do to enable at least a useful
   subset of a fully featured term batch process? Looking at it from another
   angle, which of the use cases (either that Nick listed, or that anyone 
else
   has) gives us the best value for our effort?
   - Can we also do things like limiting support for the interdependencies
   Casey mentioned? If we do approach it that way, how do we avoid setting
   ourselves up for issues parallelizing the more complicated cases?  It
   sounds like we'll need to brainstorm some of the dependency stuff anyway.
   - Are there places right now (like the elasticsearch jira) where we need
   or want to make changes to either fix, or improve, or enable some of the
   larger pictures work?

Jon, any other thoughts?  Sounded like you were waiting to see how things
played out a bit, so if you have any insight, I'd love to hear it.

Justin

On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet  wrote:

> @Jon, it looks like it is based on system date.
>
> From ElasticsearchWriter.write:
> String indexPostfix = dateFormat.format(new Date());
> ...
> indexName = indexName + "_index_" + indexPostfix;
> ...
> IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName,
> sensorType + "_doc");
>
> Justin
>
> On Tue, Feb 28, 2017 at 10:44 AM, zeo...@gmail.com 
> wrote:
>
>> I'm actually a bit surprised to see METRON-691, because I know a while
>> back
>> I did some experiments to ensure that data was being written to the
>> indexes
>> that relate to the timestamp in the message, not the current time, and I
>> thought that messages were getting written to the proper historical
>> indexes, not the current one.  This was so long ago now, though, that it
>> would require another look, and I only reviewed it operationally 

Re: [DISCUSS] System time vs. Event Time

2017-03-02 Thread Justin Leet
I'm just going to throw out a few of questions, that I don't have good
answers to.  Casey and Nick, given your familiarity with the systems
involved, do you have any thoughts?

   - What's the smallest unit of work we can do to enable at least a useful
   subset of a fully featured term batch process? Looking at it from another
   angle, which of the use cases (either that Nick listed, or that anyone else
   has) gives us the best value for our effort?
   - Can we also do things like limiting support for the interdependencies
   Casey mentioned? If we do approach it that way, how do we avoid setting
   ourselves up for issues parallelizing the more complicated cases?  It
   sounds like we'll need to brainstorm some of the dependency stuff anyway.
   - Are there places right now (like the elasticsearch jira) where we need
   or want to make changes to either fix, or improve, or enable some of the
   larger pictures work?

Jon, any other thoughts?  Sounded like you were waiting to see how things
played out a bit, so if you have any insight, I'd love to hear it.

Justin

On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet  wrote:

> @Jon, it looks like it is based on system date.
>
> From ElasticsearchWriter.write:
> String indexPostfix = dateFormat.format(new Date());
> ...
> indexName = indexName + "_index_" + indexPostfix;
> ...
> IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName,
> sensorType + "_doc");
>
> Justin
>
> On Tue, Feb 28, 2017 at 10:44 AM, zeo...@gmail.com 
> wrote:
>
>> I'm actually a bit surprised to see METRON-691, because I know a while
>> back
>> I did some experiments to ensure that data was being written to the
>> indexes
>> that relate to the timestamp in the message, not the current time, and I
>> thought that messages were getting written to the proper historical
>> indexes, not the current one.  This was so long ago now, though, that it
>> would require another look, and I only reviewed it operationally (put
>> message on topic with certain timestamp, search for it in kibana).
>>
>> If that is not the case currently (which I should be able to verify later
>> this week) then that would be pretty concerning and somewhat separate from
>> the previous "Metron Batch" style discussions, which are more focused on
>> data bulk load or historical analysis.
>>
>> I will wait to see how the rest of this conversation pans out before
>> giving
>> my thoughts on the bigger picture.
>>
>> Jon
>>
>> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella  wrote:
>>
>> > I think this is a really tricky topic, but necessary.  I've given it a
>> bit
>> > of thought over the last few months and I don't really see a great way
>> to
>> > do it given the Profiler.  Here's what I've come up with so far,
>> though, in
>> > my thinking.
>> >
>> >
>> >- Replaying events will compress events in time (e.g. 2 years of data
>> >may come through in 10 minutes)
>> >- Replaying events may result in events being out of order temporally
>> >even if it is written to kafka in order (just by virtue of hitting a
>> >different kafka partition)
>> >
>> > Given both of these, in my mind we should handle replaying of data *not*
>> > within a streaming context so we can control the order and the grouping
>> of
>> > the data.  In my mind, this is essentially the advent of batch Metron.
>> Off
>> > the top of my head, I'm having trouble thinking about how to parallelize
>> > this, however, in a pretty manner.
>> >
>> > Imagine a scenario where telemetry A has an enrichment E1 that depends
>> on
>> > profile P1 and profile P1 depends on the previous 10 minutes of data.
>> How
>> > in a batch or streaming context can we ever hope to ensure that the
>> > profiles for P1 for the last 10 minutes are in place as data flows
>> through
>> > across all data points? Now how about if the values that P1 depend on
>> are
>> > computed from a profile P2?  Essentially you have a data dependency
>> graph
>> > between enrichments and profiles and raw data that you need to work in
>> > order.
>> >
>> >
>> >
>> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet 
>> > wrote:
>> >
>> > > There's a couple JIRAs related to the use of system time vs event
>> time.
>> > >
>> > > METRON-590 Enable Use of Event Time in Profiler
>> > > 
>> > > METRON-691 Elastic Writer index partitions on system time, not event
>> time
>> > > 
>> > >
>> > > Is there anything else that needs to be making this distinction, and
>> if
>> > so,
>> > > do we need to be able to support both system time and event time for
>> it?
>> > >
>> > > My immediate thought on this is that, once we work on replaying
>> > historical
>> > > data, we'll want system time for geo data passing through.  Given that
>> > the
>> > > geo files can update, we'd want to know which geo file we actually

Re: [DISCUSS] System time vs. Event Time

2017-02-28 Thread Justin Leet
@Jon, it looks like it is based on system date.

>From ElasticsearchWriter.write:
String indexPostfix = dateFormat.format(new Date());
...
indexName = indexName + "_index_" + indexPostfix;
...
IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName,
sensorType + "_doc");

Justin

On Tue, Feb 28, 2017 at 10:44 AM, zeo...@gmail.com  wrote:

> I'm actually a bit surprised to see METRON-691, because I know a while back
> I did some experiments to ensure that data was being written to the indexes
> that relate to the timestamp in the message, not the current time, and I
> thought that messages were getting written to the proper historical
> indexes, not the current one.  This was so long ago now, though, that it
> would require another look, and I only reviewed it operationally (put
> message on topic with certain timestamp, search for it in kibana).
>
> If that is not the case currently (which I should be able to verify later
> this week) then that would be pretty concerning and somewhat separate from
> the previous "Metron Batch" style discussions, which are more focused on
> data bulk load or historical analysis.
>
> I will wait to see how the rest of this conversation pans out before giving
> my thoughts on the bigger picture.
>
> Jon
>
> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella  wrote:
>
> > I think this is a really tricky topic, but necessary.  I've given it a
> bit
> > of thought over the last few months and I don't really see a great way to
> > do it given the Profiler.  Here's what I've come up with so far, though,
> in
> > my thinking.
> >
> >
> >- Replaying events will compress events in time (e.g. 2 years of data
> >may come through in 10 minutes)
> >- Replaying events may result in events being out of order temporally
> >even if it is written to kafka in order (just by virtue of hitting a
> >different kafka partition)
> >
> > Given both of these, in my mind we should handle replaying of data *not*
> > within a streaming context so we can control the order and the grouping
> of
> > the data.  In my mind, this is essentially the advent of batch Metron.
> Off
> > the top of my head, I'm having trouble thinking about how to parallelize
> > this, however, in a pretty manner.
> >
> > Imagine a scenario where telemetry A has an enrichment E1 that depends on
> > profile P1 and profile P1 depends on the previous 10 minutes of data.
> How
> > in a batch or streaming context can we ever hope to ensure that the
> > profiles for P1 for the last 10 minutes are in place as data flows
> through
> > across all data points? Now how about if the values that P1 depend on are
> > computed from a profile P2?  Essentially you have a data dependency graph
> > between enrichments and profiles and raw data that you need to work in
> > order.
> >
> >
> >
> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet 
> > wrote:
> >
> > > There's a couple JIRAs related to the use of system time vs event time.
> > >
> > > METRON-590 Enable Use of Event Time in Profiler
> > > 
> > > METRON-691 Elastic Writer index partitions on system time, not event
> time
> > > 
> > >
> > > Is there anything else that needs to be making this distinction, and if
> > so,
> > > do we need to be able to support both system time and event time for
> it?
> > >
> > > My immediate thought on this is that, once we work on replaying
> > historical
> > > data, we'll want system time for geo data passing through.  Given that
> > the
> > > geo files can update, we'd want to know which geo file we actually need
> > to
> > > be using at the appropriate time.
> > >
> > > We'll probably also want to double check anything else that writes out
> > data
> > > to a location and provides some sort of timestamping on it.
> > >
> > > Justin
> > >
> >
> --
>
> Jon
>
> Sent from my mobile device
>


Re: [DISCUSS] System time vs. Event Time

2017-02-28 Thread zeo...@gmail.com
I'm actually a bit surprised to see METRON-691, because I know a while back
I did some experiments to ensure that data was being written to the indexes
that relate to the timestamp in the message, not the current time, and I
thought that messages were getting written to the proper historical
indexes, not the current one.  This was so long ago now, though, that it
would require another look, and I only reviewed it operationally (put
message on topic with certain timestamp, search for it in kibana).

If that is not the case currently (which I should be able to verify later
this week) then that would be pretty concerning and somewhat separate from
the previous "Metron Batch" style discussions, which are more focused on
data bulk load or historical analysis.

I will wait to see how the rest of this conversation pans out before giving
my thoughts on the bigger picture.

Jon

On Tue, Feb 28, 2017 at 9:19 AM Casey Stella  wrote:

> I think this is a really tricky topic, but necessary.  I've given it a bit
> of thought over the last few months and I don't really see a great way to
> do it given the Profiler.  Here's what I've come up with so far, though, in
> my thinking.
>
>
>- Replaying events will compress events in time (e.g. 2 years of data
>may come through in 10 minutes)
>- Replaying events may result in events being out of order temporally
>even if it is written to kafka in order (just by virtue of hitting a
>different kafka partition)
>
> Given both of these, in my mind we should handle replaying of data *not*
> within a streaming context so we can control the order and the grouping of
> the data.  In my mind, this is essentially the advent of batch Metron.  Off
> the top of my head, I'm having trouble thinking about how to parallelize
> this, however, in a pretty manner.
>
> Imagine a scenario where telemetry A has an enrichment E1 that depends on
> profile P1 and profile P1 depends on the previous 10 minutes of data.  How
> in a batch or streaming context can we ever hope to ensure that the
> profiles for P1 for the last 10 minutes are in place as data flows through
> across all data points? Now how about if the values that P1 depend on are
> computed from a profile P2?  Essentially you have a data dependency graph
> between enrichments and profiles and raw data that you need to work in
> order.
>
>
>
> On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet 
> wrote:
>
> > There's a couple JIRAs related to the use of system time vs event time.
> >
> > METRON-590 Enable Use of Event Time in Profiler
> > 
> > METRON-691 Elastic Writer index partitions on system time, not event time
> > 
> >
> > Is there anything else that needs to be making this distinction, and if
> so,
> > do we need to be able to support both system time and event time for it?
> >
> > My immediate thought on this is that, once we work on replaying
> historical
> > data, we'll want system time for geo data passing through.  Given that
> the
> > geo files can update, we'd want to know which geo file we actually need
> to
> > be using at the appropriate time.
> >
> > We'll probably also want to double check anything else that writes out
> data
> > to a location and provides some sort of timestamping on it.
> >
> > Justin
> >
>
-- 

Jon

Sent from my mobile device


Re: [DISCUSS] System time vs. Event Time

2017-02-28 Thread Nick Allen
Let's make sure we have a common understanding of the use case (there are
likely many).  What you mentioned was replaying historical data, which is
very cool, but can mean a lot of different things (being that we all have
very active imaginations).

Here are a few broad strokes of what I have been thinking about that
relates to "replay".  Hopefully, this relates to your thoughts in this
thread and I am not taking us on tangent.

(1) As a Security Data Scientist, I'd like to be able to replay historical
pcap through my signature-based, IDS suite.  Between the time when the pcap
was captured and now (2, 8, 12 weeks), my signatures have been updated
based on newly discovered threats in the wild.  If I find newly generated
alerts during replay that were not generated initially, then my systems
were likely breached by (or at least exposed to) an advanced actor with
access to a zero day vulnerability.  Since exploitation can often take
months, I still have time to react and mitigate the breach.

(2) As a Security Data Scientist, I don't want to wait for a profile to be
generated from data in real time.  It is difficult to understand whether
the profile I have created is (a) correct or (b) has any value to me unless
I can see data from it over a span of time.  If I have to wait for the
profile to be generated in real-time this slows down my progress in
performing exploratory analysis and model building.


(3) As an Investigator, I need to create a profile to investigate ongoing
suspicious activity.  I often investigate incidents that began in the past
and may or may not currently be active.  I often don't know what I need to
profile until responding to an active incident.  If I could generate a
profile from a starting point in the past, I might be able to understand
how a security incident began, how it has spread, and what assets have been
exposed.

(4) As a Platform Engineer, I was given a model to deploy in production.
The model needs data from a profile generated by the Profiler.  I'd like
instant feedback to know whether I deployed things correctly.  If I could
generate a profile from some point in the past, I could validate that the
model and profile work on production data sooner.  The model would also
start functioning sooner.








On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet  wrote:

> There's a couple JIRAs related to the use of system time vs event time.
>
> METRON-590 Enable Use of Event Time in Profiler
> 
> METRON-691 Elastic Writer index partitions on system time, not event time
> 
>
> Is there anything else that needs to be making this distinction, and if so,
> do we need to be able to support both system time and event time for it?
>
> My immediate thought on this is that, once we work on replaying historical
> data, we'll want system time for geo data passing through.  Given that the
> geo files can update, we'd want to know which geo file we actually need to
> be using at the appropriate time.
>
> We'll probably also want to double check anything else that writes out data
> to a location and provides some sort of timestamping on it.
>
> Justin
>


Re: [DISCUSS] System time vs. Event Time

2017-02-28 Thread Casey Stella
I think this is a really tricky topic, but necessary.  I've given it a bit
of thought over the last few months and I don't really see a great way to
do it given the Profiler.  Here's what I've come up with so far, though, in
my thinking.


   - Replaying events will compress events in time (e.g. 2 years of data
   may come through in 10 minutes)
   - Replaying events may result in events being out of order temporally
   even if it is written to kafka in order (just by virtue of hitting a
   different kafka partition)

Given both of these, in my mind we should handle replaying of data *not*
within a streaming context so we can control the order and the grouping of
the data.  In my mind, this is essentially the advent of batch Metron.  Off
the top of my head, I'm having trouble thinking about how to parallelize
this, however, in a pretty manner.

Imagine a scenario where telemetry A has an enrichment E1 that depends on
profile P1 and profile P1 depends on the previous 10 minutes of data.  How
in a batch or streaming context can we ever hope to ensure that the
profiles for P1 for the last 10 minutes are in place as data flows through
across all data points? Now how about if the values that P1 depend on are
computed from a profile P2?  Essentially you have a data dependency graph
between enrichments and profiles and raw data that you need to work in
order.



On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet  wrote:

> There's a couple JIRAs related to the use of system time vs event time.
>
> METRON-590 Enable Use of Event Time in Profiler
> 
> METRON-691 Elastic Writer index partitions on system time, not event time
> 
>
> Is there anything else that needs to be making this distinction, and if so,
> do we need to be able to support both system time and event time for it?
>
> My immediate thought on this is that, once we work on replaying historical
> data, we'll want system time for geo data passing through.  Given that the
> geo files can update, we'd want to know which geo file we actually need to
> be using at the appropriate time.
>
> We'll probably also want to double check anything else that writes out data
> to a location and provides some sort of timestamping on it.
>
> Justin
>


[DISCUSS] System time vs. Event Time

2017-02-28 Thread Justin Leet
There's a couple JIRAs related to the use of system time vs event time.

METRON-590 Enable Use of Event Time in Profiler

METRON-691 Elastic Writer index partitions on system time, not event time


Is there anything else that needs to be making this distinction, and if so,
do we need to be able to support both system time and event time for it?

My immediate thought on this is that, once we work on replaying historical
data, we'll want system time for geo data passing through.  Given that the
geo files can update, we'd want to know which geo file we actually need to
be using at the appropriate time.

We'll probably also want to double check anything else that writes out data
to a location and provides some sort of timestamping on it.

Justin