Before the thought becomes obsolete, I’d like to say that I agree with Nick 
about the replay scenario and threat signature databases.  I think a principal 
use case is replaying old data with new threat signatures, to detect problems 
that were undetectable at the time they happened.  The use case Casey brought 
up, where you want to reproduce the exact behavior of an earlier PiT of your 
system, including using the threat signature database versions that were 
installed at that time, would also be useful for debugging, system 
understanding, and testing, but I think it is lower priority than the former.

Another high priority use case is replaying data with new Profiler 
configurations, to answer questions that we hadn’t thought about asking before.

So, Justin, I think the minimum amount of work for a useful batch process, is 
to:
(a) Make sure event time rather than system time is usable, if not the default, 
in all components that record, manipulate, or select based on timestamps.
(b) Enable a chunk of data, defined by our shiny new time window DSL, to be 
output in chron order from sources that store whole messages (HDFS, PCAP, maybe 
Solr/ES, maybe raw data files with a time window filter), and routed into a 
kafka topic, with throttling so kafka doesn’t try to swallow several TB at once.
(c) Which can then be read by a Parser, and the result piped through the whole 
system, all the way to threat detection, profiling, and filtered re-recording.
(d) The result set (in HDFS, ES, or Profiler) needs to remain “tagged” somehow 
with a batch identifier, both so it doesn’t get mixed up with all the other 
data from that event time, and so it can be bulk-deleted if you made a mistake 
and asked for TB’s of the wrong data.

An interesting part of (c) is that we don’t really want the “batch” to 
interfere with on-going real-time processing.  Ideally the mechanism would also 
deal with data analysts submitting multiple batch requests at the same time 
(altho admittedly that could be handled with a queue).

Is it sufficient to simply depend on the event time stamp to route stuff 
appropriately?  That doesn’t seem to meet (d).  We could effectively 
“virtualize” the batch job by suffixing the kafka topic names for the whole 
data flow related to a batch.  Batch id “foley3256”, being a bunch of bro 
messages, could enter the Bro Parser on topic bro_foley3256.  To carry this 
through to enrichment, etc., maybe it is sufficient to record the sensorType as 
“bro_foley3256”, or maybe it should be sensorType “bro” on kafka topic 
“enrichment_foley3256”.  Such schemes could satisfy (d) above, also.  Obviously 
there’s a lot of possible variations on this theme.  What do you think?

--Matt

On 3/2/17, 12:54 PM, "Justin Leet" <justinjl...@gmail.com> wrote:

    I'm just going to throw out a few of questions, that I don't have good
    answers to.  Casey and Nick, given your familiarity with the systems
    involved, do you have any thoughts?
    
       - What's the smallest unit of work we can do to enable at least a useful
       subset of a fully featured term batch process? Looking at it from another
       angle, which of the use cases (either that Nick listed, or that anyone 
else
       has) gives us the best value for our effort?
       - Can we also do things like limiting support for the interdependencies
       Casey mentioned? If we do approach it that way, how do we avoid setting
       ourselves up for issues parallelizing the more complicated cases?  It
       sounds like we'll need to brainstorm some of the dependency stuff anyway.
       - Are there places right now (like the elasticsearch jira) where we need
       or want to make changes to either fix, or improve, or enable some of the
       larger pictures work?
    
    Jon, any other thoughts?  Sounded like you were waiting to see how things
    played out a bit, so if you have any insight, I'd love to hear it.
    
    Justin
    
    On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet <justinjl...@gmail.com> wrote:
    
    > @Jon, it looks like it is based on system date.
    >
    > From ElasticsearchWriter.write:
    > String indexPostfix = dateFormat.format(new Date());
    > ...
    > indexName = indexName + "_index_" + indexPostfix;
    > ...
    > IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName,
    > sensorType + "_doc");
    >
    > Justin
    >
    > On Tue, Feb 28, 2017 at 10:44 AM, zeo...@gmail.com <zeo...@gmail.com>
    > wrote:
    >
    >> I'm actually a bit surprised to see METRON-691, because I know a while
    >> back
    >> I did some experiments to ensure that data was being written to the
    >> indexes
    >> that relate to the timestamp in the message, not the current time, and I
    >> thought that messages were getting written to the proper historical
    >> indexes, not the current one.  This was so long ago now, though, that it
    >> would require another look, and I only reviewed it operationally (put
    >> message on topic with certain timestamp, search for it in kibana).
    >>
    >> If that is not the case currently (which I should be able to verify later
    >> this week) then that would be pretty concerning and somewhat separate 
from
    >> the previous "Metron Batch" style discussions, which are more focused on
    >> data bulk load or historical analysis.
    >>
    >> I will wait to see how the rest of this conversation pans out before
    >> giving
    >> my thoughts on the bigger picture.
    >>
    >> Jon
    >>
    >> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella <ceste...@gmail.com> wrote:
    >>
    >> > I think this is a really tricky topic, but necessary.  I've given it a
    >> bit
    >> > of thought over the last few months and I don't really see a great way
    >> to
    >> > do it given the Profiler.  Here's what I've come up with so far,
    >> though, in
    >> > my thinking.
    >> >
    >> >
    >> >    - Replaying events will compress events in time (e.g. 2 years of 
data
    >> >    may come through in 10 minutes)
    >> >    - Replaying events may result in events being out of order 
temporally
    >> >    even if it is written to kafka in order (just by virtue of hitting a
    >> >    different kafka partition)
    >> >
    >> > Given both of these, in my mind we should handle replaying of data 
*not*
    >> > within a streaming context so we can control the order and the grouping
    >> of
    >> > the data.  In my mind, this is essentially the advent of batch Metron.
    >> Off
    >> > the top of my head, I'm having trouble thinking about how to 
parallelize
    >> > this, however, in a pretty manner.
    >> >
    >> > Imagine a scenario where telemetry A has an enrichment E1 that depends
    >> on
    >> > profile P1 and profile P1 depends on the previous 10 minutes of data.
    >> How
    >> > in a batch or streaming context can we ever hope to ensure that the
    >> > profiles for P1 for the last 10 minutes are in place as data flows
    >> through
    >> > across all data points? Now how about if the values that P1 depend on
    >> are
    >> > computed from a profile P2?  Essentially you have a data dependency
    >> graph
    >> > between enrichments and profiles and raw data that you need to work in
    >> > order.
    >> >
    >> >
    >> >
    >> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet <justinjl...@gmail.com>
    >> > wrote:
    >> >
    >> > > There's a couple JIRAs related to the use of system time vs event
    >> time.
    >> > >
    >> > > METRON-590 Enable Use of Event Time in Profiler
    >> > > <https://issues.apache.org/jira/browse/METRON-590>
    >> > > METRON-691 Elastic Writer index partitions on system time, not event
    >> time
    >> > > <https://issues.apache.org/jira/browse/METRON-691>
    >> > >
    >> > > Is there anything else that needs to be making this distinction, and
    >> if
    >> > so,
    >> > > do we need to be able to support both system time and event time for
    >> it?
    >> > >
    >> > > My immediate thought on this is that, once we work on replaying
    >> > historical
    >> > > data, we'll want system time for geo data passing through.  Given 
that
    >> > the
    >> > > geo files can update, we'd want to know which geo file we actually
    >> need
    >> > to
    >> > > be using at the appropriate time.
    >> > >
    >> > > We'll probably also want to double check anything else that writes 
out
    >> > data
    >> > > to a location and provides some sort of timestamping on it.
    >> > >
    >> > > Justin
    >> > >
    >> >
    >> --
    >>
    >> Jon
    >>
    >> Sent from my mobile device
    >>
    >
    >
    


Reply via email to