GitHub user cestella reopened a pull request:
https://github.com/apache/metron/pull/879
METRON-1378: Create a summarizer
## Contributor Comments
We have a nice and generalized infrastructure for loading data into HBase
and interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`. It is
also useful to summarize a set of data into a static data structure, store it
on HDFS and interact with it via stellar. To this end, to complement the
`flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using the
same extractor config, will process a flat file and output a serialized object.
The usecase for this is as follows:
Let's say that I have a static list of domains in the second column of a
CSV, domains.csv, and I want to generate a bloom filter with those domains in
them sans TLD.
I should be able to create a file called `bloom.ser` with the serialized
bloom filter given the extractor config:
```
{
"config" : {
"columns" : {
"rank" : 0,
"domain" : 1
},
"value_transform" : {
"domain" : "DOMAIN_REMOVE_TLD(domain)"
},
"value_filter" : "LENGTH(domain) > 0",
"state_init" : "BLOOM_INIT()",
"state_update" : {
"state" : "BLOOM_ADD(state, domain)"
},
"state_merge" : "BLOOM_MERGE(states)",
"separator" : ","
},
"extractor" : "CSV"
}
```
Note, the associated stellar function `OBJECT_GET` is available in #880.
# Testing Plan
We should run the test plan for #445 to ensure no regressions since 80% of
this PR is just refactoring existing abstractions to reuse.
## Write out a String Locally
We are going to take the top 10k alexa domains (saved as part of #445 's
test plan to `~/top-10k.csv`)
* Keep a running sample of 20 samples per thread
* At the end, merge the samples and get a random domain from the merged
samples
* Write out the sample
### Test
* Create a file `~/extractor_sample.json` with the following contents:
```
{
"config" : {
"columns" : {
"rank" : 0,
"domain" : 1
},
"value_transform" : {
"domain" : "DOMAIN_REMOVE_TLD(domain)"
},
"value_filter" : "LENGTH(domain) > 0",
"state_init" : "SAMPLE_INIT(20)",
"state_update" : {
"state" : "SAMPLE_ADD(state, domain)"
},
"state_merge" : "GET_FIRST(SAMPLE_GET(SAMPLE_MERGE(states,
SAMPLE_INIT(1))))",
"separator" : ","
},
"extractor" : "CSV"
}
```
* Summarize via `$METRON_HOME//bin/flatfile_summarizer.sh -i ~/top-10k.csv
-o ~/sample.ser -e ./extractor_sample.json -p 5 -b 128`
* Execute `hexdump -C ./sample.ser` and ensure that there is a string in
there. It may end or start with some non-ascii bytes at the beginning and end.
e.g.
```
[root@node1 ~]# hexdump -C ./sample.ser
00000000 03 01 37 63 66 6d 6e e6 |..7cfmn.|
00000008
[root@node1 ~]# cat top-10k.csv | grep 7cfmn
4696,7cfmnf.top
```
### Typosquatting Use-case Testing
You can also follow the testing plan for #882 as this code is merged into
that PR and it shows how this feature can be used in a real use-case.
## Pull Request Checklist
Thank you for submitting a contribution to Apache Metron.
Please refer to our [Development
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
for the complete guide to follow for contributions.
Please refer also to our [Build Verification
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow
these guidelines and ask you to double check the following:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
### For code changes:
- [x] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
- [x] Have you included steps or a guide to how the change may be verified
and tested manually?
- [x] Have you ensured that the full suite of tests and checks have been
executed in the root metron folder via:
```
mvn -q clean integration-test install && build_utils/verify_licenses.sh
```
- [x] Have you written or updated unit tests and or integration tests to
verify your changes?
- [x] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [x] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in
which it is rendered by building and verifying the site-book? If not then run
the following commands and the verify changes via
`site-book/target/site/index.html`:
```
cd site-book
mvn site
```
#### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up
for your personal repository such that your branches are built there before
submitting a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cestella/incubator-metron flatfile_object_gen
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/879.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #879
----
commit 9c492c4540534fa72550aff330ce6c588f640965
Author: cstella <cestella@...>
Date: 2017-12-21T15:17:18Z
flatfile summarizer initial commit.
commit 15681143e86913a692777770d0a89e1c877e3d99
Author: cstella <cestella@...>
Date: 2017-12-21T18:50:58Z
typo
commit 935d4d2933e7156219722e54cec5dfce228fdbcc
Author: cstella <cestella@...>
Date: 2017-12-21T21:17:23Z
Updating tests and docs.
commit afe91c341608468e2637db4a02f9428ebe19353a
Author: cstella <cestella@...>
Date: 2017-12-21T21:18:20Z
more docs.
commit d955e26cf4e7776642e83b23deb305fd5a238cc2
Author: cstella <cestella@...>
Date: 2017-12-21T21:46:30Z
Renamed test.
commit ac3c612cd6fd7140a14fac9692000f04b65ecc83
Author: cstella <cestella@...>
Date: 2017-12-22T12:23:04Z
Adding a ToString writer.
commit 34cdb55f6c43049151c5b5242a73a09119de31ef
Author: cstella <cestella@...>
Date: 2017-12-22T15:10:15Z
Renamed to console writer
commit b3e4408ab98d69866774bae452e9cc47efc4fbdd
Author: cstella <cestella@...>
Date: 2017-12-22T15:14:43Z
newline issue.
commit 767e4976a723451c92ff7bbceffafd5c38086c19
Author: cstella <cestella@...>
Date: 2017-12-23T15:32:07Z
Allowing empty outputs
commit b4e40a4e47ddc6ff871ef0e95b433fb4315f8e34
Author: cstella <cestella@...>
Date: 2017-12-23T16:07:10Z
Missed a compilation error.
commit 3ed05682372b10aa544f7fbba8a93d7dca78ca25
Author: cstella <cestella@...>
Date: 2018-01-08T14:32:34Z
Merge branch 'master' into flatfile_object_gen
----
---