GitHub user cestella opened a pull request: https://github.com/apache/incubator-metron/pull/93
METRON-119 Move PCAP infrastructure from HBase As it stands, the existing approach to handling PCAP data has some issues handling high volume packet capture data. With the advent of a DPDK plugin for capturing packet data, we are going to hit some limitations on the throughput of consumption if we continue to try to push packet data into HBase at line-speed. Furthermore, storing PCAP data into HBase limits the range of filter queries that we can perform (i.e. only those expressible within the key). As of now, we require all fields to be present (source IP/port, destination IP/port and protocol), rather than allowing any wildcards. To address these issues, we should create a higher performance topology which attaches the appropriate header to the raw packet and timestamp read from Kafka (as placed onto kafka by the packet capture sensor) and appends this packet to a sequence file in HDFS. The sequence file will be rolled based on number of packets or time (e.g. 1 hrs worth of packets in a given sequence file). On the query side, we should adjust the middle tier service layer to start a MR job on the appropriate set of sequence files to filter out the appropriate packets. NOTE: the UI modifications to make this reasonable for the end-user will need to be done in a follow-on JIRA. In order to test this PR, I would suggest doing the following as the "happy path": 1. Install the pycapa library & utility via instructions [here](https://github.com/apache/incubator-metron/tree/master/metron-sensors/pycapa) 2. (if using singlenode vagrant) Kill the enrichment and sensor topologies via `for i in bro enrichment yaf snort;do storm kill $i;done` 3. Start the pcap topology via `/usr/metron/0.1BETA/bin/start_pcap_topology.sh` 4. Start the pycapa packet capture producer on eth1 via `/usr/bin/pycapa --producer --topic pcap -i eth1 -k node1:6667` 5. Watch the topology in the [Storm UI](http://node1:8744/index.html) and kill the packet capture utility from before when the number of packets ingested is over 1k. 6. Ensure that at at least 2 files exist on HDFS by running `hadoop fs -ls /apps/metron/pcap` 7. Choose a file (denoted by $FILE) and dump a few of the contents using the `pcap_inspector` utility via `/usr/metron/0.1BETA/bin/pcap_inspector.sh -i $FILE -n 5` 8. Choose one of the lines and note the source ip/port and dest ip/port 9. Go to the kibana app at [http://node1:5000](http://node1:5000) on the singlenode vagrant (ymmv on ec2) and input that query in the kibana PCAP panel. 10. Wait patiently while the MR job completes and the results are sent back in the form of a valid PCAP payload suitable for opening in wireshark 11. Open in wireshark to ensure the payload is valid. If the payload is not valid PCAP, then please look at the [job history](http://node1:19888/jobhistory) and note the reason for job failure if any. Also, please note changes and addition to the documentation for the [pcap service](https://github.com/cestella/incubator-metron/tree/METRON-119/metron-streaming/metron-api) and [pcap backend](https://github.com/cestella/incubator-metron/tree/METRON-119/metron-platform/metron-pcap-backend). You can merge this pull request into a Git repository by running: $ git pull https://github.com/cestella/incubator-metron METRON-119 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-metron/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 ---- commit e5062606519bda57eb7c1a739317e4f2011cddd1 Author: cstella <ceste...@gmail.com> Date: 2016-04-28T17:51:57Z METRON-119 Move the PCAP topology from HBase commit 99bf1632a7e5ed3d36137ec326626c0b0f84d4bf Author: cstella <ceste...@gmail.com> Date: 2016-04-28T17:56:05Z Updating the documentation. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---