[ https://issues.apache.org/jira/browse/METRON-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Allen updated METRON-1699: ------------------------------- Fix Version/s: Next + 1 > Create Batch Profiler > --------------------- > > Key: METRON-1699 > URL: https://issues.apache.org/jira/browse/METRON-1699 > Project: Metron > Issue Type: Improvement > Reporter: Nick Allen > Assignee: Nick Allen > Priority: Major > Fix For: Next + 1 > > Attachments: Screen Shot 2018-07-27 at 10.55.27 AM.png, Screen Shot > 2018-07-27 at 11.07.33 AM.png, Screen Shot 2018-07-27 at 11.10.16 AM.png > > > Create a Batch Profiler that satisfies the following use cases. > h3. Use Cases > * As a Security Data Scientist, I want to understand the historical > behaviors and trends of a profile that I have created so that I can determine > if I have created a feature set that has predictive value for model building. > * As a Security Data Scientist, I want to understand the historical > behaviors and trends of a profile that I have created so that I can determine > if I have defined the profile correctly and created a feature set that > matches reality. > * As a Security Platform Engineer, I want to generate a profile using > archived telemetry when I deploy a new model to production so that models > depending on that profile can function on day 1. > h3. Goal > * Currently, a profile can only be generated from the telemetry consumed > *after* the profile was created. > * The goal would be to enable “profile seeding” which allows profiles to be > populated from a time *before* the profile was created. > * A profile would be seeded using the telemetry that has been archived by > Metron in HDFS. > * A profile consumer should not be able to distinguish the “seeded” portion > of a profile. > !Screen Shot 2018-07-27 at 10.55.27 AM.png! > h3. Current State > * There are currently two ports of the Profiler; the Streaming Profiler that > handles streaming data in Storm and the other that runs in the REPL and > allows a user to manually build, test, and debug profiles. > * These ports largely share a common code base in > metron-analytics/metron-profiler-common. > * A smaller set of “orchestration” logic is required to maintain each port; > one for Storm, another for the REPL. > * Both Profiler ports supports both system time and event time processing. > !Screen Shot 2018-07-27 at 11.07.33 AM.png! > h3. Approach > * Create a third port of the Profiler; the Batch Profiler. > * The Batch Profiler will be built to run in Spark so that the telemetry can > be consumed in batch. > * Allows a user to seed profiles using the JSON telemetry that is archived > in HDFS by Metron Indexing. > * Only generates the profile data stored in HBase, not the messages that are > produced for Threat Triage and Kafka. > * Any number of profiles can be generated at once, but no dependencies > between the profiles are supported. A dependency is where one profile is a > consumer of the profile generated by another. > * The Batch Profiler must use the timestamps contained within the telemetry; > it runs on event time. Luckily the Profiler already supports event time. > * Enable a pluggable mechanism so that telemetry stored in different formats > can be consumed by the Batch Profiler. For example, the Profiler should be > able to consume telemetry stored as raw JSON or in other formats like ORC or > Parquet. > !Screen Shot 2018-07-27 at 11.10.16 AM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)