Github user markgrover commented on a diff in the pull request:
https://github.com/apache/incubator-spot/pull/7#discussion_r95460852
--- Diff: docs/Open Data Model/Open Data Model.md ---
@@ -0,0 +1,755 @@
+Overview............................................................................
2
+
+Apache Spot Open Data Model
Strategy....................................................................................
2
+
+Apache Spot Enabled Use
Cases...................................................................................................
3
+
+Data
Model.....................................................................................................................................
4
+
+Naming
Convention.......................................................................................................................
5
+
+Prefixes.........................................................................................................................................................
5
+
+Security Event Log/Alert Data
Model..........................................................................................
6
+
+Common..........................................................................................................................................................
7
+
+Network...........................................................................................................................................................
9
+
+File................................................................................................................................................................
10
+
+Endpoint........................................................................................................................................................
11
+
+User...............................................................................................................................................................
11
+
+DNS.............................................................................................................................................................
11
+
+Proxy.........................................................................................................................................................
12
+
+HTTP..............................................................................................................................................................
13
+
+SMTP............................................................................................................................................................
14
+
+FTP.............................................................................................................................................................
15
+
+SNMP....................................................................................................................................................
16
+
+TLS...........................................................................................................................................................
16
+
+SSH...............................................................................................................................................................
17
+
+DHCP.............................................................................................................................................................
17
+
+IRC................................................................................................................................................................
17
+
+Flow............................................................................................................................................................
17
+
+Context
Models............................................................................................................................
18
+
+User Context
Model.......................................................................................................................
18
+
+Endpoint Context
Model..........................................................................................................................
20
+
+Network Context
Model............................................................................................................................22
+
+Extensibility of Data
Model.........................................................................................................
23
+
+Model
Relationships....................................................................................................................
24
+
+Data Ingestion
Framework..........................................................................................................
24
+
+Data
Formats................................................................................................................................
25
+
+Avro...............................................................................................................................................................
25
+
+JSON......................................................................................................................................................
27
+
+Parquet...................................................................................................................................................
27
+
+ODM Resultant Capability - A Singular
View............................................................................
28
+
+**Example - Advanced Threat
Modeling**...................................................................................................
28
+
+**Example - Singular Data View for Complete
Context**................................................................. 29
+
+
+
+**Overview**
+----
+
+This document describes a strategy for creating an open data model (ODM)
for Apache Spot (incubating) (formerly known as âOpen Network Insight
(ONI)â) in support of cyber security analytic use cases. It also describes
the use cases for which Apache Spot (incubating) running on the Cloudera
platform is uniquely capable of addressing along with the data model.
+
+
+
+**Apache Spot (incubating) Open Data Model Strategy**
+------------------------------------
+
+The Apache Spot (incubating) Open Data Model (ODM) strategy aims to extend
Apache Spot (incubating) capabilities to support a broader set of cyber
security use cases than initially supported. The primary use case initially
supported by Apache Spot (incubating) includes Network Traffic Analysis for
network flows (Netflow, sflow, etc.), DNS and Proxy; primarily the
identification of threats through anomalous event detection using both
supervised and unsupervised machine learning.
+
+In order to support a broader set of use cases, Spot must be extended to
collect and analyze other common
+âevent-orientedâ data sources analyzed for cyber threats, including
but not limited to the following log types:
+
+> âProxy
+>
+> âWeb server
+>
+> âOperating system
+>
+> âFirewall
+>
+> âIntrusion Prevention/Detection (IDS/ IPS)
+>
+> âData Loss Prevention
+>
+> âActive Directory / Identity Management
+>
+> âUser/Entity Behavior Analysis
+>
+> âEndpoint Protection/Asset Management
+>
+> âNetwork Metadata/Session and PCAP files
+>
+> âNetwork Access Control
+>
+> âMail
+>
+> âVPN
+>
+> â etc..
+
+One of the biggest challenges organizations face today in combating cyber
threats is collecting and normalizing data from the myriad of security event
data sources (hundreds) in order to build the needed analytics. This often
results in the analytics being dependent upon the specific technologies used by
an organization to detect threats and prevents the needed flexibility and
agility to keep up with these ever-increasing (and complex) threats.
Technology lock-in is sometimes a byproduct of todayâs status quo, as itâs
extremely costly to add new technologies (or replace existing ones) because of
the downstream analytic dependencies.
+
+To achieve the goal of extending Apache Spot (incubating) to support
additional use cases, it is necessary to create an open data model for the most
relevant security event and contextual data sources; Security event logs or
alerts, Network context, User details and information that comes from the
endpoints or any other console that are being use to manage the security /
administration of our endpoints. The presence of an open data model, which can
be applied âon-readâ or âon-writeâ, in batch or stream, will allow for
the separation of security analytics from the specific data sources on which
they are built. This âseparation of dutiesâ will enable organizations to
build analytics that are not dependent upon specific technologies and provide
the flexibility to change underlying data sources and also provide segmentation
of this information, without impacting the analytics. This will also afford
security vendors the opportunity to build additional products on top of t
he Open Data Model to drive new revenue streams and also to design new ways to
detect threats and APT.
+
+
+**Apache Spot (incubating) Enabled**
+----
+
+**Use Cases**
+-------------
+
+Spot on the Cloudera platform is uniquely positioned to help address the
following cyber security use cases,
+which are not effectively addressed by legacy technologies:
+
+
+
+ **- Detection of known & unknown threats leveraging machine learning and
advanced analytic modeling**
+
+Current technologies are limited in the analytics they can apply to detect
threats. These limitations stem from the inability to collect all the data
sources needed to effectively identify threats (structured, unstructured, etc.)
and inability to process the massive volumes of data needed to do so (billions
of events per day). Legacy technologies are typically focus and limited to
rules-based and signature detection. They are somewhat âeffectiveâ at
detecting known threats but struggle with new threats.
+
+Spot addresses these gaps through its ability to collect any data type of
any volume. Coupled with the various analytic frameworks that are provided
(including machine learning), Spot enables a whole new class of analytics that
can scale to todayâs demands. The topic model used by Spot to detect
anomalous network traffic is one example of where the Spot platform excels.
+
+ **- Reduction of mean time to incident detection & resolution (MTTR)**
+
+One of the challenges organizations face today is detecting threats early
enough to minimize adverse impacts. This stems from the limitations previously
discussed with regards to limited analytics. It can also be attributed to the
fact that most of the investigative queries often take hours or days to return
results. Legacy technologies canât offer or have a central data store for
facilitating such investigations due to their inability to store and serve the
massive amounts of data involved. This cripples incident investigations and
results in MTTRs of many weeks or months, meanwhile the adverse impacts of the
breach are magnified, thus making the threat harder to eradicate.
+
+Apache Spot (incubating) addresses these gaps by providing the capability
for a central data store that houses ALL the data needed to facilitate an
investigation, returning investigative query results in seconds and minutes
(vs. hours and days). Spot can effectively reduce incident MTTR and reduce
adverse impacts of a breach.
+
+ **- Threat Hunting**
+
+Itâs become necessary for organizations to âhuntâ for active threats
as traditional passive threat detection approaches are not sufficient.
âHuntingâ involves performing ad-hoc searches and queries over vast amounts
of data representing many weeks and monthsâ worth of events, as well as
applying ad-hoc / tune algorithms to detect the needle in the haystack.
Traditional systems do not perform well for these types of activities as the
query results sometimes take hours and days to be retrieved. These traditional
systems also lack the analytic flexibility to construct the necessary
algorithms and logic needed.
+
+Apache Spot (incubating) addresses these gaps in the same ways it
addresses others; by providing a central data store with the needed analytic
frameworks that scale to the needed workloads.
+
+**Data Model**
+----------
+In order to provide a framework for effectively analyzing data for cyber
threats, it is necessary to collect and
+analyze standard security event logs/alerts and contextual data regarding
the entities referenced in these logs/alerts. The most common entities include
network, user and endpoint, but there are others such as file.
+
+In the diagram below, the raw event tells us that user âjsmithâ
successfully logged in to an Oracle database from the IP address 10:1.1.3.
Based on the raw event only, we donât know if this event is a legitimate
threat or not. After injecting user and endpoint context, the enriched event
tells us this event is a potential threat that requires further investigation.
+
+
+
+Based on the need to collect and analyze both security events, logs or
alerts and contextual data, support for
+the following types of security information are planned for inclusion in
the Spot Open Data Model:
+
+ - Security event logs/alerts
+This data type includes event logs from common data sources used to detect
threats and includes network flows, operating system logs, IPS/IDS logs,
firewall logs, proxy logs, web logs, DLP logs, etc.
+
+ - Network context data
+This data type includes information about the network, which can be
gleaned from Whois servers, asset databases and other similar data sources.
+
+ - User context data
+This data type includes information from user and identity management
systems including Active Directory, Centrify, and other identity and access
management systems.
+
+ - Endpoint context data
+This data includes information about endpoint systems (servers,
workstations, routers, switches, etc.) and can be sourced from asset management
systems, vulnerability scanners, and endpoint management/detection/response
systems such as Webroot, Tanium, Sophos, Endgame, CarbonBlack, Intel Security
ePO and others.
+
+ - File context data** (ROADMAP ITEM)**
+This data includes contextual information about files and can be sourced
from systems such as FireEye, Application Control and others.
+
+ - Threat intelligence context data **(ROADMAP ITEM)**
+This data includes contextual information about URLs, domains, websites,
files and others.
+
+**Naming Convention**
+-----------------
+
+A naming convention is needed for the Open Data Model to represent common
attributes across vendor products and technologies. The naming convention is
described below.
+
+**Prefixes**
+--------
+
+| Prefix | Description |
+|---|---|
+| src | Corresponds to the âsourceâ fields within a given event (i.e.
source address)|
+| dst | Corresponds to the âdestinationâ fields within a given event
(i.e. destination address) |
+| dvc | Corresponds to the âdeviceâ applicable fields within a given
event (i.e. device address) and represent where the event originated |
+| fwd | Forwarded from device |
+| request | Corresponds to requested values (vs. those returned, i.e.
ârequested URIâ) |
+| response | Corresponds to response value (vs. those requested) |
+| file | Corresponds to the âfileâ fields within a given event (i.e.
file type) |
+| user | Corresponds to user attributes (i.e. name, id, etc.) |
+| xlate | Corresponds to translated values within a given event (i.e.
src_xlate_ip for âtranslated source ip addressâ |
+| in | Ingress|
+| out | Egress |
+| new | New value |
+| orig | Original value |
+| app | Corresponds to values associated with application events |
+
+
+**Security Event Log/Alert Data Model**
+-----------------------------------
+
+The data model for security event logs/alerts is detailed in the below.
The attributes are categorized as follows:
+
+ - Common -attributes that are common across many device types
+ - Device -attributes that are applicable to the device that generated the
event
+ - File -attributes that are applicable to file objects referenced in the
event
+ - Endpoint -attributes that are applicable to the endpoints referenced in
the event
+ - User- attributes that are applicable to the user referenced in the event
+ - Proxy - attributes that are applicable to proxy events
+ - Protocol
+
+> DNS - attributes that are specific to DNS events
+> HTTP - attributes that are specific to HTTP events
+> SMTP, SSH, TLS, DHCP, IRC, SNMP and FTP
+
+Note: The model will evolve to include reserved attributes for additional
device types that are not currently represented. The model can currently be
extended to support ANY attribute for ANY device type by following the guidance
outlined in the section titled **âExtensibility of Data Modelâ.**
+
+Note: Attributes denoted in BLUE represent those that are listed in the
model multiple times for the purpose of
--- End diff --
I don't know if markdown can do color. If not, we should change BLUE to
bold. And, change all the blue items in the original doc to be bolded.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---