[DISCUSS] Batch Profiler Feature Branch

Nick Allen Wed, 19 Sep 2018 08:15:24 -0700

I would like to open a discussion to get the Batch Profiler feature branch
merged into master as part of METRON-1699 [1] Create Batch Profiler.  All
of the work that I had in mind for our first draft of the Batch Profiler
has been completed.  Please take a look through what I have and let me know
if there are other features that you think are required *before* we merge.

Previous list discussions on this topic include [2] and [3].

(Q) What can I do with the feature branch?

* With the Batch Profiler, you can backfill/seed profiles using archived
telemetry. This enables the following types of use cases.

1. As a Security Data Scientist, I want to understand the historical
behaviors and trends of a profile that I have created so that I can
determine if I have created a feature set that has predictive value for
model building.

2. As a Security Data Scientist, I want to understand the historical
behaviors and trends of a profile that I have created so that I can
determine if I have defined the profile correctly and created a feature set
that matches reality.

3. As a Security Platform Engineer, I want to generate a profile
using archived telemetry when I deploy a new model to production so that
models depending on that profile can function on day 1.

* METRON-1699 [1] includes a more detailed description of the feature.

(Q) What work was completed?

* The Batch Profiler runs on Spark and was implemented in Java to remain
consistent with our current Java-heavy code base.

* The Batch Profiler is executed from the command-line. It can be
launched using a script or by calling `spark-submit`, which may be useful
for advanced users.

* Input telemetry can be consumed from multiple sources; for example HDFS
or the local file system.

* Input telemetry can be consumed in multiple formats; for example JSON
or ORC.

* The 'output' profile measurements are persisted in HBase and is
consistent with the Storm Profiler.

* It can be run on any underlying engine supported by Spark. I have
tested it both in 'local' mode and on a YARN cluster.

* It is installed automatically by the Metron MPack.

* A README was added that documents usage instructions.

* The existing Profiler code was refactored so that as much code as
possible is shared between the 3 Profiler ports; Storm, the Stellar REPL,
and Spark. For example, the logic which determines the timestamp of a
message was refactored so that it could be reused by all ports.

* metron-profiler-common: The common Profiler code shared amongst
each port.
* metron-profiler-storm: Profiler on Storm
* metron-profiler-spark: Profiler on Spark
* metron-profiler-repl: Profiler on the Stellar REPL
* metron-profiler-client: The client code for retrieving profile
data; for example PROFILE_GET.

* There are 3 separate RPM and DEB packages now created for the Profiler.

* metron-profiler-storm-*.rpm
* metron-profiler-spark-*.rpm
* metron-profiler-repl-*.rpm

* The Profiler integration tests were enhanced to leverage the Profiler
Client logic to validate the results.

* Review METRON-1699 [1] for a complete break-down of the tasks that have
been completed on the feature branch.

(Q) What limitations exist?

* You must manually install Spark to use the Batch Profiler. The Metron
MPack does not treat Spark as a Metron dependency and so does not install
it automatically.

* You do not configure the Batch Profiler in Ambari. It is configured
and executed completely from the command-line.

* To run the Batch Profiler in 'Full Dev', you have to take the following
manual steps. Some of these are arguably limitations with how Ambari
installs Spark 2 in the version of HDP that we run.

1. Install Spark 2 using Ambari.

2. Tell Spark how to talk with HBase.

SPARK_HOME=/usr/hdp/current/spark2-client
cp /usr/hdp/current/hbase-client/conf/hbase-site.xml
$SPARK_HOME/conf/

3. Create the Spark History directory in HDFS.

export HADOOP_USER_NAME=hdfs
hdfs dfs -mkdir /spark2-history

4. Change the default input path to `hdfs://localhost:8020/...` to
match the port defined by HDP, instead of port 9000.

[1] https://issues.apache.org/jira/browse/METRON-1699
[2]
https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E
[3]
https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E

[DISCUSS] Batch Profiler Feature Branch

Reply via email to