I think this is a really tricky topic, but necessary. I've given it a bit of thought over the last few months and I don't really see a great way to do it given the Profiler. Here's what I've come up with so far, though, in my thinking.
- Replaying events will compress events in time (e.g. 2 years of data may come through in 10 minutes) - Replaying events may result in events being out of order temporally even if it is written to kafka in order (just by virtue of hitting a different kafka partition) Given both of these, in my mind we should handle replaying of data *not* within a streaming context so we can control the order and the grouping of the data. In my mind, this is essentially the advent of batch Metron. Off the top of my head, I'm having trouble thinking about how to parallelize this, however, in a pretty manner. Imagine a scenario where telemetry A has an enrichment E1 that depends on profile P1 and profile P1 depends on the previous 10 minutes of data. How in a batch or streaming context can we ever hope to ensure that the profiles for P1 for the last 10 minutes are in place as data flows through across all data points? Now how about if the values that P1 depend on are computed from a profile P2? Essentially you have a data dependency graph between enrichments and profiles and raw data that you need to work in order. On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet <justinjl...@gmail.com> wrote: > There's a couple JIRAs related to the use of system time vs event time. > > METRON-590 Enable Use of Event Time in Profiler > <https://issues.apache.org/jira/browse/METRON-590> > METRON-691 Elastic Writer index partitions on system time, not event time > <https://issues.apache.org/jira/browse/METRON-691> > > Is there anything else that needs to be making this distinction, and if so, > do we need to be able to support both system time and event time for it? > > My immediate thought on this is that, once we work on replaying historical > data, we'll want system time for geo data passing through. Given that the > geo files can update, we'd want to know which geo file we actually need to > be using at the appropriate time. > > We'll probably also want to double check anything else that writes out data > to a location and provides some sort of timestamping on it. > > Justin >