[ https://issues.apache.org/jira/browse/METRON-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin Leet updated METRON-1378: -------------------------------- Fix Version/s: 0.5.0 > Create a summarizer > ------------------- > > Key: METRON-1378 > URL: https://issues.apache.org/jira/browse/METRON-1378 > Project: Metron > Issue Type: Improvement > Reporter: Casey Stella > Assignee: Casey Stella > Priority: Major > Fix For: 0.5.0 > > > We have a nice and generalized infrastructure for loading data into HBase and > interacting with it via `flatfile_loader.sh` and `ENRICHMENT_GET()`. It is > also useful to summarize a set of data into a static data structure, store it > on HDFS and interact with it via stellar. To this end, to complement the > `flatfile_loader.sh`, we should have a `flatfile_summarizer.sh` that, using > the same extractor config, will process a flat file and output a serialized > object. > The usecase for this is as follows: > Let's say that I have a static list of domains in the second column of a CSV, > domains.csv, and I want to generate a bloom filter with those domains in them > sans TLD. > I should be able to create a file called `bloom.ser` with the serialized > bloom filter given the extractor config: > {code} > { > "config" : { > "columns" : { > "rank" : 0, > "domain" : 1 > }, > "value_transform" : { > "domain" : "DOMAIN_REMOVE_TLD(domain)" > }, > "value_filter" : "LENGTH(domain) > 0", > "state_init" : "BLOOM_INIT()", > "state_update" : { > "state" : "BLOOM_ADD(state, domain)" > }, > "state_merge" : "BLOOM_MERGE(states)", > "separator" : "," > }, > "extractor" : "CSV" > } > {code} > Note, the associated stellar function `OBJECT_GET` is pending. -- This message was sent by Atlassian JIRA (v7.6.3#76005)