I am a huge fan of these ideas and your comments Matt. These use cases have been in the back of my head for a while, so I'm happy to see them getting discussed. It would be a huge step forward for Metron capabilities.
I see connections between this discussion and both METRON-192 <https://issues.apache.org/jira/browse/METRON-192> and METRON-477 <https://issues.apache.org/jira/browse/METRON-477>, as well as a potential for (c) to be improved in the longer term (i.e. not "minimum amount of work") to be read in by sensors/forensics tools and not just get re-parsed. Maybe even during data expiration/transformation (as described in METRON-477). This is definitely not trivial. That said, I expect that my team will be working on METRON-477 in the near-medium term future and I want to make sure that the effort that we undertake aligns with the outcome of this discussion. Jon On Thu, Mar 2, 2017 at 5:24 PM Matt Foley <ma...@apache.org> wrote: > Before the thought becomes obsolete, I’d like to say that I agree with > Nick about the replay scenario and threat signature databases. I think a > principal use case is replaying old data with new threat signatures, to > detect problems that were undetectable at the time they happened. The use > case Casey brought up, where you want to reproduce the exact behavior of an > earlier PiT of your system, including using the threat signature database > versions that were installed at that time, would also be useful for > debugging, system understanding, and testing, but I think it is lower > priority than the former. > > Another high priority use case is replaying data with new Profiler > configurations, to answer questions that we hadn’t thought about asking > before. > > So, Justin, I think the minimum amount of work for a useful batch process, > is to: > (a) Make sure event time rather than system time is usable, if not the > default, in all components that record, manipulate, or select based on > timestamps. > (b) Enable a chunk of data, defined by our shiny new time window DSL, to > be output in chron order from sources that store whole messages (HDFS, > PCAP, maybe Solr/ES, maybe raw data files with a time window filter), and > routed into a kafka topic, with throttling so kafka doesn’t try to swallow > several TB at once. > (c) Which can then be read by a Parser, and the result piped through the > whole system, all the way to threat detection, profiling, and filtered > re-recording. > (d) The result set (in HDFS, ES, or Profiler) needs to remain “tagged” > somehow with a batch identifier, both so it doesn’t get mixed up with all > the other data from that event time, and so it can be bulk-deleted if you > made a mistake and asked for TB’s of the wrong data. > > An interesting part of (c) is that we don’t really want the “batch” to > interfere with on-going real-time processing. Ideally the mechanism would > also deal with data analysts submitting multiple batch requests at the same > time (altho admittedly that could be handled with a queue). > > Is it sufficient to simply depend on the event time stamp to route stuff > appropriately? That doesn’t seem to meet (d). We could effectively > “virtualize” the batch job by suffixing the kafka topic names for the whole > data flow related to a batch. Batch id “foley3256”, being a bunch of bro > messages, could enter the Bro Parser on topic bro_foley3256. To carry this > through to enrichment, etc., maybe it is sufficient to record the > sensorType as “bro_foley3256”, or maybe it should be sensorType “bro” on > kafka topic “enrichment_foley3256”. Such schemes could satisfy (d) above, > also. Obviously there’s a lot of possible variations on this theme. What > do you think? > > --Matt > > On 3/2/17, 12:54 PM, "Justin Leet" <justinjl...@gmail.com> wrote: > > I'm just going to throw out a few of questions, that I don't have good > answers to. Casey and Nick, given your familiarity with the systems > involved, do you have any thoughts? > > - What's the smallest unit of work we can do to enable at least a > useful > subset of a fully featured term batch process? Looking at it from > another > angle, which of the use cases (either that Nick listed, or that > anyone else > has) gives us the best value for our effort? > - Can we also do things like limiting support for the > interdependencies > Casey mentioned? If we do approach it that way, how do we avoid > setting > ourselves up for issues parallelizing the more complicated cases? > It > sounds like we'll need to brainstorm some of the dependency stuff > anyway. > - Are there places right now (like the elasticsearch jira) where we > need > or want to make changes to either fix, or improve, or enable some > of the > larger pictures work? > > Jon, any other thoughts? Sounded like you were waiting to see how > things > played out a bit, so if you have any insight, I'd love to hear it. > > Justin > > On Tue, Feb 28, 2017 at 11:08 AM, Justin Leet <justinjl...@gmail.com> > wrote: > > > @Jon, it looks like it is based on system date. > > > > From ElasticsearchWriter.write: > > String indexPostfix = dateFormat.format(new Date()); > > ... > > indexName = indexName + "_index_" + indexPostfix; > > ... > > IndexRequestBuilder indexRequestBuilder = > client.prepareIndex(indexName, > > sensorType + "_doc"); > > > > Justin > > > > On Tue, Feb 28, 2017 at 10:44 AM, zeo...@gmail.com <zeo...@gmail.com > > > > wrote: > > > >> I'm actually a bit surprised to see METRON-691, because I know a > while > >> back > >> I did some experiments to ensure that data was being written to the > >> indexes > >> that relate to the timestamp in the message, not the current time, > and I > >> thought that messages were getting written to the proper historical > >> indexes, not the current one. This was so long ago now, though, > that it > >> would require another look, and I only reviewed it operationally > (put > >> message on topic with certain timestamp, search for it in kibana). > >> > >> If that is not the case currently (which I should be able to verify > later > >> this week) then that would be pretty concerning and somewhat > separate from > >> the previous "Metron Batch" style discussions, which are more > focused on > >> data bulk load or historical analysis. > >> > >> I will wait to see how the rest of this conversation pans out before > >> giving > >> my thoughts on the bigger picture. > >> > >> Jon > >> > >> On Tue, Feb 28, 2017 at 9:19 AM Casey Stella <ceste...@gmail.com> > wrote: > >> > >> > I think this is a really tricky topic, but necessary. I've given > it a > >> bit > >> > of thought over the last few months and I don't really see a > great way > >> to > >> > do it given the Profiler. Here's what I've come up with so far, > >> though, in > >> > my thinking. > >> > > >> > > >> > - Replaying events will compress events in time (e.g. 2 years > of data > >> > may come through in 10 minutes) > >> > - Replaying events may result in events being out of order > temporally > >> > even if it is written to kafka in order (just by virtue of > hitting a > >> > different kafka partition) > >> > > >> > Given both of these, in my mind we should handle replaying of > data *not* > >> > within a streaming context so we can control the order and the > grouping > >> of > >> > the data. In my mind, this is essentially the advent of batch > Metron. > >> Off > >> > the top of my head, I'm having trouble thinking about how to > parallelize > >> > this, however, in a pretty manner. > >> > > >> > Imagine a scenario where telemetry A has an enrichment E1 that > depends > >> on > >> > profile P1 and profile P1 depends on the previous 10 minutes of > data. > >> How > >> > in a batch or streaming context can we ever hope to ensure that > the > >> > profiles for P1 for the last 10 minutes are in place as data flows > >> through > >> > across all data points? Now how about if the values that P1 > depend on > >> are > >> > computed from a profile P2? Essentially you have a data > dependency > >> graph > >> > between enrichments and profiles and raw data that you need to > work in > >> > order. > >> > > >> > > >> > > >> > On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet < > justinjl...@gmail.com> > >> > wrote: > >> > > >> > > There's a couple JIRAs related to the use of system time vs > event > >> time. > >> > > > >> > > METRON-590 Enable Use of Event Time in Profiler > >> > > <https://issues.apache.org/jira/browse/METRON-590> > >> > > METRON-691 Elastic Writer index partitions on system time, not > event > >> time > >> > > <https://issues.apache.org/jira/browse/METRON-691> > >> > > > >> > > Is there anything else that needs to be making this > distinction, and > >> if > >> > so, > >> > > do we need to be able to support both system time and event > time for > >> it? > >> > > > >> > > My immediate thought on this is that, once we work on replaying > >> > historical > >> > > data, we'll want system time for geo data passing through. > Given that > >> > the > >> > > geo files can update, we'd want to know which geo file we > actually > >> need > >> > to > >> > > be using at the appropriate time. > >> > > > >> > > We'll probably also want to double check anything else that > writes out > >> > data > >> > > to a location and provides some sort of timestamping on it. > >> > > > >> > > Justin > >> > > > >> > > >> -- > >> > >> Jon > >> > >> Sent from my mobile device > >> > > > > > > > > -- Jon