I think this is a really tricky topic, but necessary.  I've given it a bit
of thought over the last few months and I don't really see a great way to
do it given the Profiler.  Here's what I've come up with so far, though, in
my thinking.


   - Replaying events will compress events in time (e.g. 2 years of data
   may come through in 10 minutes)
   - Replaying events may result in events being out of order temporally
   even if it is written to kafka in order (just by virtue of hitting a
   different kafka partition)

Given both of these, in my mind we should handle replaying of data *not*
within a streaming context so we can control the order and the grouping of
the data.  In my mind, this is essentially the advent of batch Metron.  Off
the top of my head, I'm having trouble thinking about how to parallelize
this, however, in a pretty manner.

Imagine a scenario where telemetry A has an enrichment E1 that depends on
profile P1 and profile P1 depends on the previous 10 minutes of data.  How
in a batch or streaming context can we ever hope to ensure that the
profiles for P1 for the last 10 minutes are in place as data flows through
across all data points? Now how about if the values that P1 depend on are
computed from a profile P2?  Essentially you have a data dependency graph
between enrichments and profiles and raw data that you need to work in
order.



On Tue, Feb 28, 2017 at 8:03 AM, Justin Leet <justinjl...@gmail.com> wrote:

> There's a couple JIRAs related to the use of system time vs event time.
>
> METRON-590 Enable Use of Event Time in Profiler
> <https://issues.apache.org/jira/browse/METRON-590>
> METRON-691 Elastic Writer index partitions on system time, not event time
> <https://issues.apache.org/jira/browse/METRON-691>
>
> Is there anything else that needs to be making this distinction, and if so,
> do we need to be able to support both system time and event time for it?
>
> My immediate thought on this is that, once we work on replaying historical
> data, we'll want system time for geo data passing through.  Given that the
> geo files can update, we'd want to know which geo file we actually need to
> be using at the appropriate time.
>
> We'll probably also want to double check anything else that writes out data
> to a location and provides some sort of timestamping on it.
>
> Justin
>

Reply via email to