GitHub user cestella opened a pull request:

    https://github.com/apache/incubator-metron/pull/93

    METRON-119 Move PCAP infrastructure from HBase

    As it stands, the existing approach to handling PCAP data has some issues 
handling high volume packet capture data.  With the advent of a DPDK plugin for 
capturing packet data, we are going to hit some limitations on the  throughput 
of consumption if we continue to try to push packet data into HBase at 
line-speed.
    
    Furthermore, storing PCAP data into HBase limits the range of filter 
queries that we can perform (i.e. only those expressible within the key).  As 
of now, we require all fields to be present (source IP/port, destination 
IP/port and protocol), rather than allowing any wildcards.
    
    To address these issues, we should create a higher performance topology 
which attaches the appropriate header to the raw packet and timestamp read from 
Kafka (as placed onto kafka by the packet capture sensor) and appends this 
packet to a sequence file in HDFS.  The sequence file will be rolled based on 
number of packets or time (e.g. 1 hrs worth of packets in a given sequence 
file).
    
    On the query side, we should adjust the middle tier service layer to start 
a MR job on the appropriate set of sequence files to filter out the appropriate 
packets.  NOTE: the UI modifications to make this reasonable for the end-user 
will need to be done in a follow-on JIRA.
    
    In order to test this PR, I would suggest doing the following as the "happy 
path":
    
    1. Install the pycapa library & utility via instructions 
[here](https://github.com/apache/incubator-metron/tree/master/metron-sensors/pycapa)
    2. (if using singlenode vagrant) Kill the enrichment and sensor topologies 
via `for i in bro enrichment yaf snort;do storm kill $i;done`
    3. Start the pcap topology via 
`/usr/metron/0.1BETA/bin/start_pcap_topology.sh`
    4. Start the pycapa packet capture producer on eth1 via `/usr/bin/pycapa 
--producer --topic pcap -i eth1 -k node1:6667`
    5. Watch the topology in the [Storm UI](http://node1:8744/index.html) and 
kill the packet capture utility from before when the number of packets ingested 
is over 1k.
    6. Ensure that at at least 2 files exist on HDFS by running `hadoop fs -ls 
/apps/metron/pcap`
    7. Choose a file (denoted by $FILE) and dump a few of the contents using 
the `pcap_inspector` utility via `/usr/metron/0.1BETA/bin/pcap_inspector.sh -i 
$FILE -n 5`
    8. Choose one of the lines and note the source ip/port and dest ip/port
    9. Go to the kibana app at [http://node1:5000](http://node1:5000) on the 
singlenode vagrant (ymmv on ec2) and input that query in the kibana PCAP panel.
    10. Wait patiently while the MR job completes and the results are sent back 
in the form of a valid PCAP payload suitable for opening in wireshark
    11. Open in wireshark to ensure the payload is valid.
    
    If the payload is not valid PCAP, then please look at the [job 
history](http://node1:19888/jobhistory) and note the reason for job failure if 
any.
    
    Also, please note changes and addition to the documentation for the [pcap 
service](https://github.com/cestella/incubator-metron/tree/METRON-119/metron-streaming/metron-api)
 and [pcap 
backend](https://github.com/cestella/incubator-metron/tree/METRON-119/metron-platform/metron-pcap-backend).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cestella/incubator-metron METRON-119

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-metron/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #93
    
----
commit e5062606519bda57eb7c1a739317e4f2011cddd1
Author: cstella <ceste...@gmail.com>
Date:   2016-04-28T17:51:57Z

    METRON-119 Move the PCAP topology from HBase

commit 99bf1632a7e5ed3d36137ec326626c0b0f84d4bf
Author: cstella <ceste...@gmail.com>
Date:   2016-04-28T17:56:05Z

    Updating the documentation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to