[GitHub] [metron] mmiklavc commented on issue #1523: METRON-2232 Upgrade to Hadoop 3.1.1

GitBox Wed, 09 Oct 2019 14:28:22 -0700

mmiklavc commented on issue #1523: METRON-2232 Upgrade to Hadoop 3.1.1
URL: https://github.com/apache/metron/pull/1523#issuecomment-540208150
 
 
   ## Testing
   
   Adapted from a few places
   * https://gist.github.com/nickwallen/ed67fdc8b399f6db5fa4901b07fc3fff
   * 
https://cwiki.apache.org/confluence/display/METRON/2016/04/25/Metron+Tutorial+-+Fundamentals+Part+1%3A+Creating+a+New+Telemetry
   
   ### Preliminaries
   
   Test using the centos7 development environment.  
   
   * Start up the centos7 dev environment.
       ```
       cd metron-deployment/development/centos7
       vagrant destroy -f
       vagrant up
       # ssh into the box as root@node1, pwd=vagrant
       ```
   
   * Run as root is fine
   * Set env vars
   ```
   source /etc/default/metron
   ```
   * Root user needs a home dir in HDFS. You can do that as follows:
   ```
   sudo -u hdfs hdfs dfs -mkdir /user/root
   sudo -u hdfs hdfs dfs -chown root:root /user/root
   ```
   * Download the Alexa top 1m data set
   ```
   cd ~/
   wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
   unzip top-1m.csv.zip
   ```
   
   * Stage import file
   ```
   head -n 10000 top-1m.csv > top-10k.csv
   hdfs dfs -put top-10k.csv /tmp
   ```
   
   * Truncate hbase
   ```
   echo "truncate 'enrichment'" | hbase shell
   ```
   
   ### Basic Indexing and Enrichment
   
   Ensure that we can continue to parse, enrich, and index telemetry.  Verify 
data is flowing through the system, from parsing to indexing
   
   1. Open Ambari and navigate to the Metron service 
http://node1:8080/#/main/services/METRON/summary
   
   1. Open the Alerts UI.  Verify alerts show up in the main UI - click the 
search icon (you may need to wait a moment for them to appear)
   
   1. Go to the Alerts UI and ensure that an ever increasing number of 
telemetry from Bro, Snort, and YAF are visible by watching the total alert 
count increase over time.
   
   1. Ensure that geoip enrichment is occurring.  The telemetry should contain 
fields like `enrichments:geo:ip_src_addr:location_point`.
   
   1. Head back to Ambari and select the Kibana service 
http://node1:8080/#/main/services/KIBANA/summary
   
   1. Open the Kibana dashboard via the "Metron UI" option in the quick links
   
   1. Verify the dashboard is populating
   
   ### Batch Indexing
   
   1. Use the Alerts UI to retrieve a rough count of the number of Bro messages 
that have been indexed.
   
   1. Retrieve the number of Bro messages that have been indexed in HDFS.
       ```
       [root@node1 0.7.2]# hdfs dfs -cat /apps/metron/indexing/indexed/bro/* | 
wc -l
       2785
       ```
   
   1. The number of messages indexed in HDFS should be close to the number 
indexed to the search indices.
   
   ###  Streaming Enrichments
   
   Adapted from the [Metron Tutorial 
Series](https://cwiki.apache.org/confluence/display/METRON/2016/06/16/Metron+Tutorial+-+Fundamentals+Part+6%3A+Streaming+Enrichment).
   
     1. Launch the Stellar REPL.
         ```
         cd $METRON_HOME
         $METRON_HOME/bin/stellar -z $ZOOKEEPER
         ```
   
     1. Define the streaming enrichment and save it as a new source of 
telemetry.
   
         ```
         [Stellar]>>> conf := SHELL_EDIT(conf)
         {
           "parserClassName": "org.apache.metron.parsers.csv.CSVParser",
           "writerClassName": 
"org.apache.metron.writer.hbase.SimpleHbaseEnrichmentWriter",
           "sensorTopic": "user",
           "parserConfig": {
             "shew.table": "enrichment",
             "shew.cf": "t",
             "shew.keyColumns": "ip",
             "shew.enrichmentType": "user",
             "columns": {
               "user": 0,
               "ip": 1
             }
           }
         }
         [Stellar]>>>
         [Stellar]>>> CONFIG_PUT("PARSER", conf, "user")
         ```
   
     1. Go to the Management UI and start the new parser called 'user'.
   
     1. Create some test telemetry.
         ```
         [Stellar]>>> msgs := ["user1,192.168.1.1", "user2,192.168.1.2", 
"user3,192.168.1.3"]
         [user1,192.168.1.1, user2,192.168.1.2, user3,192.168.1.3]
         [Stellar]>>> KAFKA_PUT("user", msgs)
         3
         [Stellar]>>> KAFKA_PUT("user", msgs)
         3
         [Stellar]>>> KAFKA_PUT("user", msgs)
         3
         ```
   
     1. Ensure that the enrichments are persisted in HBase.
         ```
         [Stellar]>>> ENRICHMENT_GET('user', '192.168.1.1', 'enrichment', 't')
         {original_string=user1,192.168.1.1, 
guid=a6caf3c1-2506-4eb7-b33e-7c05b77cd72c, user=user1, timestamp=1551813589399, 
source.type=user}
   
         [Stellar]>>> ENRICHMENT_GET('user', '192.168.1.2', 'enrichment', 't')
         {original_string=user2,192.168.1.2, 
guid=49e4b8fa-c797-44f0-b041-cfb47983d54a, user=user2, timestamp=1551813589399, 
source.type=user}
   
         [Stellar]>>> ENRICHMENT_GET('user', '192.168.1.3', 'enrichment', 't')
         {original_string=user3,192.168.1.3, 
guid=324149fd-6c4c-42a3-b579-e218c032ea7f, user=user3, timestamp=1551813589402, 
source.type=user}
         ```
   
   ### Enrichment Coprocessor
   
     1. Confirm that the 'user' enrichment added in the previous section was 
'found' by the coprocessor.
           * Go to Swagger. 
           * Click the `sensor-enrichment-config-controller` option.
           * Click the `GET 
/api/v1/sensor/enrichment/config/list/available/enrichments` option.
   
     1. Click the "Try it out!" button. You should see an array returned with 
the value of each enrichment type that you have loaded.
       ```
       [
         "user"
       ]
       ```
   
   ### Enrichment Stellar Functions in Storm
   
     Adapted from 
(https://cwiki.apache.org/confluence/display/METRON/2016/04/28/Metron+Tutorial+-+Fundamentals+Part+2%3A+Creating+a+New+Enrichment)
 to load
     the user data.
   
     1. Create a simple file called `user.csv`.
       ```
       jdoe,192.168.138.2,
       moredoe,192.168.138.158
       ```
       
     1. Create a file called `user-extractor.json`.
         ```
         {
           "config": {
             "columns": {
               "user": 0,
               "ip": 1
             },
             "indicator_column": "ip",
             "separator": ",",
             "type": "user"
           },
           "extractor": "CSV"
         }
         ```
   
     1. Import the data.
         ```
         source /etc/default/metron
         $METRON_HOME/bin/flatfile_loader.sh -i ./user.csv -t enrichment -c t 
-e ./user-extractor.json
         ```
   
     1. Validate that the enrichment loaded successfully.
         ```
         [root@node1 0.7.2]# source /etc/default/metron
         [root@node1 0.7.2]# $METRON_HOME/bin/stellar -z $ZOOKEEPER
         
         [Stellar]>>> ip_src_addr := "192.168.138.158"
         192.168.138.158
         
         [Stellar]>>> ENRICHMENT_GET('user', ip_src_addr, 'enrichment', 't')
         {ip=192.168.138.158, user=moredoe}
   
         [Stellar]>>> ip_dst_addr := "192.168.138.2"
         192.168.138.2
         
         [Stellar]>>> ENRICHMENT_GET('user', ip_dst_addr, 'enrichment', 't')
         {ip=192.168.138.2, user=jdoe}
         ```
   
     1. Use the User data to enrich the telemetry.  Run the following commands 
in the REPL.
         ```
         [Stellar]>>> bro := SHELL_EDIT()
         {
          "enrichment" : {
            "fieldMap": {
              "stellar" : {
                "config" : {
                  "users" : "ENRICHMENT_GET('user', ip_dst_addr, 'enrichment', 
't')",
                  "users2" : "ENRICHMENT_GET('user', ip_src_addr, 'enrichment', 
't')"
                }
              }
            }
          },
          "threatIntel": {
            "fieldMap": {},
            "fieldToTypeMap": {}
          }
         }
         [Stellar]>>> CONFIG_PUT("ENRICHMENT", bro, "bro")
         ```
   
     1. Wait for the new configuration to be picked up by the running topology.
   
     1. Review the Bro telemetry indexed into Elasticsearch.  Look for records 
where the `ip_dst_addr` is `192.168.138.2`. Ensure that some of the messages 
have the following fields created from the enrichment. (Wait a few minutes 
longer and you should also eventually start to see records with fields 
`"users2:user": "moredoe"`).
         * `users:user`
         * `users:ip`
         ```
         {
           "_index": "bro_index_2019.08.13.20",
           "_type": "bro_doc",
           "_id": "AWyMxSJFg1bv3MpSt284",
           ...
           "_source": {          
             "ip_dst_addr": "192.168.138.2",
             "ip_src_addr": "192.168.138.158",
             "timestamp": 1565729823979,
             "source:type": "bro",
             "guid": "6778beb4-569d-478f-b1c9-8faaf475ac2f"
             ...
             "users:user": "jdoe",
             "users:ip": "192.168.138.2",
             ...
           },
           ...
         }
         ```
   
   ### Loaders and Summarizers in MR mode
   
   #### Test the flatfile loader in MR mode
   
   * Create an extractor.json for the CSV data by editing `extractor.json` and 
pasting in these contents:
   ```
   {
     "config" : {
       "columns" : {
          "domain" : 1,
          "rank" : 0
                   }
       ,"indicator_column" : "domain"
       ,"type" : "alexa"
       ,"separator" : ","
                },
     "extractor" : "CSV"
   }
   ```
   
   * Import from HDFS via MR
   ```
   # import data into hbase 
   $METRON_HOME/bin/flatfile_loader.sh -i /tmp/top-10k.csv -t enrichment -c t 
-e ./extractor.json -m MR
   # count data written and verify it's 10k
   echo "count 'enrichment'" | hbase shell
   ```
   
   #### Test the flatfile summarizer in MR mode
   
   * Create an extractor-count.json file and paste the following:
   ```
   {
     "config" : {
       "columns" : {
          "rank" : 0,
          "domain" : 1
       },
       "value_transform" : {
          "domain" : "DOMAIN_REMOVE_TLD(domain)"
       },
       "value_filter" : "LENGTH(domain) > 0",
       "state_init" : "0L",
       "state_update" : {
          "state" : "state + LENGTH( DOMAIN_TYPOSQUAT( domain ))"
                        },
       "state_merge" : "REDUCE(states, (s, x) -> s + x, 0)",
       "separator" : ","
     },
     "extractor" : "CSV"
   }
   ```
   
   * Create the summary from HDFS via MR
   ```
   $METRON_HOME/bin/flatfile_summarizer.sh -i /tmp/top-10k.csv -e 
~/extractor_count.json -p 5 -om CONSOLE -m MR
   ```
   * Verify you see a count in the output similar to the following:
   ```
   Processing /root/top-10k.csv
   19/10/03 21:19:56 WARN resolver.BaseFunctionResolver: Using System 
classloader
   Processed 9999 - \
   3478276
   ```
   
   ### Legacy HBase Adapter
   
   We are going to perform the same enrichment, but instead using the legacy 
HBase Adapter.
   
     1. Use the User data to enrich the telemetry.  Run the following commands 
in the REPL.
         ```
         [Stellar]>>> yaf := SHELL_EDIT()
         {
           "enrichment" : {
             "fieldMap" : {
               "hbaseEnrichment" : [ "ip_dst_addr" ]
             },
             "fieldToTypeMap" : {
                "ip_dst_addr" : [ "user" ]
             },
             "config" : {
               "typeToColumnFamily" : {
                 "user" : "t"
               }
             }
           },
           "threatIntel" : { },
           "configuration" : { }
         }
         [Stellar]>>> CONFIG_PUT("ENRICHMENT", yaf, "yaf")
         ```
       
     1. Wait for the new configuration to be picked up by the running topology.
   
     1. Review the YAF telemetry indexed into Elasticsearch.  Look for records 
where the `ip_dst_addr` is `192.168.138.2`. Ensure that some of the messages 
have the following fields created from the enrichment.
         * `enrichments:hbaseEnrichment:ip_dst_addr:user:ip`
         * `enrichments:hbaseEnrichment:ip_dst_addr:user:user`
         ```
         {
           "_index": "yaf_index_2019.08.15.03",
           "_type": "yaf_doc",
           "_id": "AWyTZAwEIFY9jxc2THLF",
           "_version": 1,
           "_score": null,
           "_source": {
             "source:type": "yaf",
             "ip_dst_addr": "192.168.138.2",
             "ip_src_addr": "192.168.138.158",
             "guid": "6c73c09d-f099-4646-b653-762adce121fe",
             ...
             "enrichments:hbaseEnrichment:ip_dst_addr:user:ip": "192.168.138.2",
             "enrichments:hbaseEnrichment:ip_dst_addr:user:user": "jdoe",
           }
         }
         ```   
   ### Profiler
   
   #### Profiler in the REPL
   
   1. Test a profile in the REPL according to [these 
instructions](https://github.com/apache/metron/tree/master/metron-analytics/metron-profiler-repl#getting-started).
   
       ```
       [Stellar]>>> values := PROFILER_FLUSH(profiler)
       [{period={duration=900000, period=1723089, start=1550780100000, 
end=1550781000000}, profile=hello-world, groups=[], value=4, 
entity=192.168.138.158}]
       ```
   
   #### Streaming Profiler
    
   1. Deploy that profile to the Streaming Profiler in Storm.
   
       ```
       [Stellar]>>> CONFIG_PUT("PROFILER", conf)
       ```
   
   1. Wait for the Streaming Profiler in Storm to flush and retrieve the 
measurement from HBase.  
   
       For the impatient, you can reset the period duration to 1 minute. 
Alternatively, you can allow the Profiler topology to work for a minute or two 
and then kill the `profiler` topology which will force it to flush a profile 
measurement to HBase.
   
       Retrieve the measurement from HBase.  Prior to this PR, it was not 
possible to query HBase from the REPL.
       ```
       [Stellar]>>> 
PROFILE_GET("hello-world","192.168.138.158",PROFILE_FIXED(30,"DAYS"))
       [2979]
       ```
   
   #### Batch Profiler
   
   1. Install Spark using Ambari.
   
       1. Stop Storm, YARN, Elasticsearch, Kibana, and Kafka.
   
       1. Install Spark2 using Ambari.
   
       1. Ensure that Spark can talk with HBase.
           ```
           cp /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/
           ```
   
   1. Use the Batch Profiler to back-fill your profile.  To do this, follow the 
direction [provided 
here](https://github.com/apache/metron/tree/master/metron-analytics/metron-profiler-spark#getting-started).
   
   1. Retrieve the entire profile, including the back-filled data.
   
       ```
       [Stellar]>>> 
PROFILE_GET("hello-world","192.168.138.158",PROFILE_FIXED(30,"DAYS"))
       [1203, 2849, 2900, 1944, 1054, 1241, 1721]
       ```
   
   ### PCAP
   
   Pulled from https://github.com/apache/metron/pull/1157#issuecomment-412972370
   
   Get PCAP data into Metron: 
   1. Install and setup pycapa (this has been updated in master recently) - 
https://github.com/apache/metron/blob/master/metron-sensors/pycapa/README.md#centos-6
   2. (if using singlenode vagrant) Kill the enrichment, profiler, indexing, 
and sensor topologies via `for i in bro enrichment random_access_indexing 
batch_indexing yaf snort;do storm kill $i;done`
   3. Start the pcap topology via $METRON_HOME/bin/start_pcap_topology.sh
   4. Start the pycapa packet capture producer on eth1
   ```
   cd /opt/pycapa/pycapa-venv/bin/usr/bin
   pycapa --producer --kafka-topic pcap --interface eth1 --kafka-broker 
$BROKERLIST
   ```
   5. Watch the topology in the Storm UI and kill the packet capture utility 
started earlier when the number of packets ingested is over 3k.
   6. You can leave your virtualenv session now via `deactivate`
   7. Ensure that at at least 3 files exist on HDFS by running `hdfs dfs -ls 
/apps/metron/pcap/input`
   8. Choose a file (denoted by $FILE) and dump a few of the contents using the 
pcap_inspector utility
   ```
   FILE=<file path in hdfs>
   $METRON_HOME/bin/pcap_inspector.sh -i $FILE -n 5
   ```
   9. Choose one of the lines in your output and note the protocol. e.g.
   ```
   TS: October 9, 2019 8:43:39 PM UTC,ip_src_addr: 192.168.66.1,ip_src_port: 
60911,ip_dst_addr: 192.168.66.121,ip_dst_port: 8080,protocol: 6
   TS: October 9, 2019 8:43:39 PM UTC,ip_src_addr: 192.168.66.121,ip_src_port: 
8080,ip_dst_addr: 192.168.66.1,ip_dst_port: 60911,protocol: 6
   TS: October 9, 2019 8:43:39 PM UTC,ip_src_addr: 192.168.66.121,ip_src_port: 
8080,ip_dst_addr: 192.168.66.1,ip_dst_port: 60911,protocol: 6
   TS: October 9, 2019 8:43:39 PM UTC,ip_src_addr: 192.168.66.121,ip_src_port: 
8080,ip_dst_addr: 192.168.66.1,ip_dst_port: 60911,protocol: 6
   TS: October 9, 2019 8:43:39 PM UTC,ip_src_addr: 192.168.66.1,ip_src_port: 
60911,ip_dst_addr: 192.168.66.121,ip_dst_port: 8080,protocol: 6
   ```
   
   **Note** when you run the fixed and query filter commands below, the 
resulting file will be placed in the execution directory where you kicked off 
the job from.
   
   #### Fixed filter
   
   1. Run a fixed filter query by executing the following command with the 
values noted above (match your start_time format to the date format provided - 
default is to use millis since epoch)
   2. `cd ~/; $METRON_HOME/bin/pcap_query.sh fixed -st <start_time> -df 
"yyyyMMdd" -p <protocol_num> -rpf 500`
   3. Verify the MR job finishes successfully. Upon completion, you should see 
multiple files named with relatively current datestamps in your current 
directory, e.g. pcap-data-20160617160549737+0000.pcap
   4. Copy the files to your local machine and verify you can them it in 
Wireshark. I chose a middle file and the last file. The middle file should have 
500 records (per the records_per_file option), and the last one will likely 
have a number of records <= 500.
   
   #### Query filter
   
   1. Run a Stellar query filter query by executing a command similar to the 
following, with the values noted above (match your start_time format to the 
date format provided - default is to use millis since epoch)
   2. `$METRON_HOME/bin/pcap_query.sh query -st "20160617" -df "yyyyMMdd" 
-query "protocol == '6'"  -rpf 500`
   3. Verify the MR job finishes successfully. Upon completion, you should see 
multiple files named with relatively current datestamps in your current 
directory, e.g. pcap-data-20160617160549737+0000.pcap
   4. Copy the files to your local machine and verify you can them it in 
Wireshark. I chose a middle file and the last file. The middle file should have 
500 records (per the records_per_file option), and the last one will likely 
have a number of records <= 500.
   
   ### MaaS


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [metron] mmiklavc commented on issue #1523: METRON-2232 Upgrade to Hadoop 3.1.1

Reply via email to