[GitHub] metron pull request #961: METRON-1487 Define Performance Benchmarks for Enri...

JonZeolla Wed, 14 Mar 2018 18:17:02 -0700

Github user JonZeolla commented on a diff in the pull request:

    https://github.com/apache/metron/pull/961#discussion_r174652892
  
    --- Diff: metron-platform/metron-enrichment/Performance.md ---
    @@ -0,0 +1,527 @@
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing, software
    +distributed under the License is distributed on an "AS IS" BASIS,
    +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +See the License for the specific language governing permissions and
    +limitations under the License.
    +-->
    +
    +# Enrichment Performance
    +
    +This guide defines a set of benchmarks used to measure the performance of 
the Enrichment topology.  The guide also provides detailed steps on how to 
execute those benchmarks along with advice for tuning the Unified Enrichment 
topology.
    +
    +* [Benchmarks](#benchmarks)
    +* [Benchmark Execution](#benchmark-execution)
    +* [Performance Tuning](#performance-tuning)
    +* [Benchmark Results](#benchmark-results)
    +
    +## Benchmarks
    +
    +The following section describes a set of enrichments that will be used to 
benchmark the performance of the Enrichment topology.
    +
    +* [Geo IP Enrichment](#geo-ip-enrichment)
    +* [HBase Enrichment](#hbase-enrichment)
    +* [Stellar Enrichment](#stellar-enrichment)
    +
    +### Geo IP Enrichment
    +
    +This benchmark measures the performance of executing a Geo IP enrichment.  
Given a valid IP address the enrichment will append detailed location 
information for that IP.  The location information is sourced from an external 
Geo IP data source like [Maxmind](https://github.com/maxmind/GeoIP2-java).
    +
    +#### Configuration
    +
    +Adding the following Stellar expression to the Enrichment topology 
configuration will define a Geo IP enrichment.
    +```
    +geo := GEO_GET(ip_dst_addr)
    +```
    +
    +After the enrichment process completes, the  telemetry message will 
contain a set of fields with location information for the given IP address.
    +```
    +{
    +   "ip_dst_addr":"151.101.129.140",
    +   ...
    +   "geo.city":"San Francisco",
    +   "geo.country":"US",
    +   "geo.dmaCode":"807",
    +   "geo.latitude":"37.7697",
    +   "geo.location_point":"37.7697,-122.3933",
    +   "geo.locID":"5391959",
    +   "geo.longitude":"-122.3933",
    +   "geo.postalCode":"94107",
    + }
    +```
    +
    +### HBase Enrichment
    +
    +This benchmark measures the performance of executing an enrichment that 
retrieves data from an external HBase table. This type of enrichment is useful 
for enriching telemetry from an Asset Database or other source of relatively 
static data.
    +
    +#### Configuration
    +
    +Adding the following Stellar expression to the Enrichment topology 
configuration will define an Hbase enrichment.  This looks up the 'ip_dst_addr' 
within an HBase table 'top-1m' and returns a hostname.
    +```
    +top1m := ENRICHMENT_GET('top-1m', ip_dst_addr, 'top-1m', 't')
    +```
    +
    +After the telemetry has been enriched, it will contain the host and IP 
elements that were retrieved from the HBase table.
    +```
    +{
    +   "ip_dst_addr":"151.101.2.166",
    +   ...
    +   "top1m.host":"earther.com",
    +   "top1m.ip":"151.101.2.166"
    +}
    +```
    +
    +### Stellar Enrichment
    +
    +This benchmark measures the performance of executing a basic Stellar 
expression.  In this benchmark, the enrichment is purely a computational task 
that has no dependence on an external system like a database.  
    +
    +#### Configuration
    +
    +Adding the following Stellar expression to the Enrichment topology 
configuration will define a basic Stellar enrichment.  The following returns 
true if the IP is in the given subnet and false otherwise.
    +```
    +local := IN_SUBNET(ip_dst_addr, '192.168.0.0/24')
    +```
    +
    +After the telemetry has been enriched, it will contain a field with a 
boolean value indicating whether the IP was within the given subnet.
    +```
    +{
    +   "ip_dst_addr":"151.101.2.166",
    +   ...
    +   "local":false
    +}
    +```
    +
    +## Benchmark Execution
    +
    +This section describes the steps necessary to execute the performance 
benchmarks for the Enrichment topology.
    +
    +* [Prepare Enrichment Data](#prepare-enrichment-data)
    +* [Load HBase with Enrichment Data](#load-hbase-with-enrichment-data)
    +* [Configure the Enrichments](#configure-the-enrichments)
    +* [Create Input Telemetry](#create-input-telemetry)
    +* [Cluster Setup](#cluster-setup)
    +* [Monitoring](#monitoring)
    +
    +### Prepare Enrichment Data
    +
    +The Alexa Top 1 Million was used as an data source for these benchmarks.
    +
    +1. Download the [Alexa Top 1 
Million](http://s3.amazonaws.com/alexa-static/top-1m.csv.zip).
    +
    +2. For each hostname, query DNS to retrieve an associated IP address.  
    +
    +   A script like the following can be used for this.  There is no need to 
do this for all 1 million entries in the data set. Doing this for around 10,000 
records is sufficient.
    +
    +   ```python
    +   import dns.resolver
    +   import csv
    +
    +   resolver = dns.resolver.Resolver()
    +   resolver.nameservers = ['8.8.8.8', '8.8.4.4']
    +
    +   with open('top-1m.csv', 'r') as infile:
    +     with open('top-1m-with-ip.csv', 'w') as outfile:
    +
    +       reader = csv.reader(infile, delimiter=',')
    +       writer = csv.writer(outfile, delimiter=',')
    +       for row in reader:
    +
    +         host = row[1]
    +         try:
    +           response = resolver.query(host, "A")
    +           for record in response:
    +             ip = record
    +             writer.writerow([host, ip])
    +             print "host={}, ip={}".format(host, ip)
    +
    +         except:
    +           pass
    +   ```
    +
    +3. The resulting data set contains an IP to hostname mapping.
    +   ```bash
    +   $ head top-1m-with-ip.csv
    +   google.com,172.217.9.46
    +   youtube.com,172.217.4.78
    +   facebook.com,157.240.18.35
    +   baidu.com,220.181.57.216
    +   baidu.com,111.13.101.208
    +   baidu.com,123.125.114.144
    +   wikipedia.org,208.80.154.224
    +   yahoo.com,98.139.180.180
    +   yahoo.com,206.190.39.42
    +   reddit.com,151.101.1.140
    +   ```
    +
    +### Load HBase with Enrichment Data
    +
    +1. Create an HBase table for this data.  
    +
    +   Ensure that the table is evenly distributed across the HBase nodes.  
This can be done by pre-splitting the table or splitting the data after loading 
it.  
    +
    +   ```
    +   create 'top-1m', 't', {SPLITS => ['2','4','6','8','a','c','e']}
    +   ```
    +
    +1. Create a configuration file called `extractor.json`.  This defines how 
the data will be loaded into the HBase table.
    +
    +   ```bash
    +   > cat extractor.json
    +   {
    +       "config": {
    +           "columns": {
    +               "host" : 0,
    +               "ip": 1
    +           },
    +           "indicator_column": "ip",
    +           "type": "top-1m",
    +           "separator": ","
    +       },
    +       "extractor": "CSV"
    +   }
    +   ```
    +
    +1. Use the `flatfile_loader.sh` to load the data into the HBase table.
    +   ```
    +   $METRON_HOME/bin/flatfile_loader.sh \
    +           -e extractor.json \
    +           -t top-1m \
    +           -c t \
    +           -i top-1m-with-ip.csv
    +   ```
    +
    +### Configure the Enrichments
    +
    +1. Define the Enrichments using the REPL.
    +
    +   ```
    +   > $METRON_HOME/bin/stellar -z $ZOOKEEPER
    +   Stellar, Go!
    +
    +   [Stellar]>>> conf
    +   {
    +     "enrichment": {
    +       "fieldMap": {
    +        "stellar" : {
    +          "config" : {
    +            "geo" : "GEO_GET(ip_dst_addr)",
    +            "top1m" : "ENRICHMENT_GET('top-1m', ip_dst_addr, 'top-1m', 
't')",
    +            "local" : "IN_SUBNET(ip_dst_addr, '192.168.0.0/24')"
    +          }
    +        }
    +       },
    +       "fieldToTypeMap": {
    +       }
    +     },
    +     "threatIntel": {
    +     }
    +   }
    +   [Stellar]>>> CONFIG_PUT("ENRICHMENT", conf, "asa")
    +   ```
    +
    +### Create Input Telemetry
    +
    +1.  Create a template file that defines what your input telemetry will 
look-like.
    +
    +   ```bash
    +   > cat asa.template
    +   {"ciscotag": "ASA-1-123123", "source.type": "asa", "ip_dst_addr": 
"$DST_ADDR", "original_string": "<134>Feb 22 17:04:43 AHOSTNAME %ASA-1-123123: 
Built inbound ICMP connection for faddr 192.168.11.8/50244 gaddr 
192.168.1.236/0 laddr 192.168.1.1/161", "ip_src_addr": "192.168.1.35", 
"syslog_facility": "local1", "action": "built", "syslog_host": "AHOSTNAME", 
"timestamp": "$METRON_TS", "protocol": "icmp", "guid": "$METRON_GUID", 
"syslog_severity": "info"}
    +   ```
    +
    +2.  Use the template file along with the enrichment data to create input 
telemetry with varying IP addresses.
    +
    +   ```bash
    +   for i in $(head top-1m-with-ip.csv | awk -F, '{print $2}');do
    +           cat asa.template | sed "s/\$DST_ADDR/$i/";
    +   done > asa.input.template
    +   ```
    +
    +3. Use the `load_test.sh` script to push messages onto the input topic 
`enrichments` and monitor the output topic `indexing`.
    +
    +   If the topology is keeping up, obviously the events per second produced 
on the input topic should roughly match the output topic.
    +
    +   ```
    +   $METRON_HOME/bin/load_test.sh \
    +           -e 200000 \
    +           -ot enrichments \
    +           -mt indexing \
    +           -p 10 \
    +           -t asa.input.template \
    +           -z $ZOOKEEPER
    +   ```
    +
    +   [TODO] Link to the docs that get created for the `load_test.sh` script.
    +
    +### Cluster Setup
    +
    +#### Isolation
    +
    +The Enrichment topology depends on an environment with at least two and 
often three components that work together; Storm, Kafka, and HBase.  When any 
of two of these are run on the same node, it can be difficult to identify which 
of them is becoming a bottleneck.  This can cause poor and highly volatile 
performance as each steals resources from the other.  
    +
    +It is highly recommended that each of these systems be fully isolated from 
the others.  Storm should be run on nodes that are completely isolated from 
Kafka and HBase.
    +
    +### Monitoring
    +
    +1. The `load_test.sh` script will report the throughput for the input and 
output topics.  
    +
    +   * The input throughput should roughly match the output throughput if 
the topology is able to handle a given load.
    +
    +   * Not only are the raw throughput numbers important, but also the 
consistency of what is reported over time.  If the reported throughput is 
sporadic, then further tuning may be required.
    +
    +1. The Storm UI is obviously an important source of information.  The bolt 
capacity, complete latency, and any reported errors are all important to monitor
    +
    +1. The load reported by the OS is also an important metric to monitor.  
    +
    +   * The load metric should be monitored to ensure that each node is being 
pushed sufficiently, but not too much.
    +
    +    * The load should be evenly distributed across each node.  If the load 
is uneven, this may indicate a problem.
    +
    +   A simple script like the following is sufficient for the task.
    +
    +   ```
    +   for host in $(cat cluster.txt); do
    +     echo $host;
    +     ssh root@$host 'uptime';
    +   done
    +   ```
    +
    +1. Monitoring the Kafka offset lags indicates how far behind a consumer 
may be.  This can be very useful to determine if the topology is keeping up.
    +
    +   ```
    +   ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
    +       --command-config=/tmp/consumergroup.config \
    +       --describe \
    +       --group enrichments \
    +       --bootstrap-server $BROKERLIST \
    +       --new-consumer
    +   ```
    +
    +1. A tool like [Kafka Manager](https://github.com/yahoo/kafka-manager) is 
also very useful for monitoring the input and output topics during test 
execution.
    +
    +## Performance Tuning
    +
    +The approach to tuning the topology will look something like the 
following.  More detailed tuning information is available next to each named 
parameter
    +
    +* Start the tuning process with a single worker.  After tuning the bolts 
within a single worker, scale out with additional worker processes.
    +
    +* Initially set the thread pool size to 1.  Increase this value slowly 
only after tuning the other parameters first.  Consider that each worker has 
its own thread pool and the total size of this thread pool should be far less 
than the total number of cores available in the cluster.
    +
    +* Initially set each bolt parallelism hint to the number of partitions on 
the input Kafka topic.  Monitor bolt capacity and increase the parallelism hint 
for any bolt whose capacity is close to or exceeds 1.  
    +
    +* If the topology is not able to keep-up with a given input, then 
increasing the parallelism is the primary means to scale up.
    +
    +* Parallelism units can be used for determining how to distribute 
processing tasks across the topology.  The sum of parallelism can be close to, 
but should not far exceed this value.
    +
    +    (number of worker nodes in cluster * number cores per worker node) - 
(number of acker tasks)
    +
    +* The throughput that the topology is able to sustain should be relatively 
consistent.  If the throughput fluctuates greatly, increase back pressure using 
[`topology.max.spout.pending`](#topology-max-spout-pending).
    --- End diff --
    
    This should be `#topologymaxspoutpending`.

---

[GitHub] metron pull request #961: METRON-1487 Define Performance Benchmarks for Enri...

Reply via email to