Github user justinleet commented on a diff in the pull request:
https://github.com/apache/metron/pull/879#discussion_r160179259
--- Diff: metron-platform/metron-data-management/README.md ---
@@ -354,3 +357,91 @@ The parameters for the utility are as follows:
| -r | --remote_dir | No | HDFS directory to land
formatted GeoIP file - defaults to /apps/metron/geo/\<epoch millis\>/ |
| -t | --tmp_dir | No | Directory for landing
the temporary GeoIP data - defaults to /tmp |
| -z | --zk_quorum | Yes | Zookeeper Quorum URL
(zk1:port,zk2:port,...) |
+
+### Flatfile Summarizer
+
+The shell script `$METRON_HOME/bin/flatfile_summarizer.sh` will read data
from local disk, HDFS or URLs and generate a summary object.
+The object will be serialized and written to disk, either HDFS or local
disk depending on the output mode specified.
+
+It should be noted that this utility uses the same extractor config as the
`flatfile_loader.sh`,
+but as the output target is not a key value store (but rather a summary
object), it is not necessary
+to specify certain configs:
+* `indicator`, `indicator_filter` and `indicator_transform` are not
required, but will be executed if present.
+As in the loader, there will be an indicator field available if you so
specify it (by using `indicator` in the config).
+* `type` is neither required nor used
+
+Indeed, some new configs are expected:
+* `state_init` : Executed once to initialize the state object (the object
written out).
+* `state_update`: Called once per message. The fields available are the
fields for the row as well as
+ * `indicator` - the indicator value if you've specified it in the config
+ * `state` - the current state. Useful for adding to the state (e.g.
`BLOOM_ADD(state, val)` where `val` is the name of a field).
+* `state_merge` : If you are running this multi-threaded and your objects
can be merged, this is the statement that will
+merge the state objects created per thread. There is a special field
available to this config:
+ * `states` - a list of the state objects
+
+One special thing to note here is that there is a special configuration
+parameter to the Extractor config that is only considered during this
+loader:
+* inputFormat : This specifies how to consider the data. The two
implementations are `BY_LINE` and `WHOLE_FILE`.
+
+The default is `BY_LINE`, which makes sense for a list of CSVs where
+each line indicates a unit of information which can be imported.
+However, if you are importing a set of STIX documents, then you want
+each document to be considered as input to the Extractor.
+
+#### Example
+
+Consider the possibility that you want to generate a bloom filter with all
of the domains in a CSV structured similarly to
+the Alexa top 1M domains, so the columns are:
+* rank
+* domain name
+
+You want to generate a bloom filter with just the domains, not considering
the TLD.
+You would execute the following to:
+* read data from `./top-1m.csv`
+* write data to `./filter.ser`
+* use 5 threads
+
+```
+$METRON_HOME/bin/flatfile_summarizer.sh -i ./top-1m.csv -o ./filter.ser -e
./extractor.json -p 5 -b 128
+```
+
+To configure this, `extractor.json` would look like:
+```
+{
+ "config" : {
+ "columns" : {
+ "rank" : 0,
+ "domain" : 1
+ },
+ "value_transform" : {
+ "domain" : "DOMAIN_REMOVE_TLD(domain)"
+ },
+ "value_filter" : "LENGTH(domain) > 0",
+ "state_init" : "BLOOM_INIT()",
+ "state_update" : {
+ "state" : "BLOOM_ADD(state, domain)"
+ },
+ "state_merge" : "BLOOM_MERGE(states)",
+ "separator" : ","
+ },
+ "extractor" : "CSV"
+}
+```
+
+#### Parameters
+
+The parameters for the utility are as follows:
+
+| Short Code | Long Code | Is Required? | Description
|
+|------------|---------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| -h | | No | Generate the help
screen/set of options
|
+| -q | --quiet | No | Do not update progress
|
+| -e | --extractor_config | Yes | JSON Document
describing the extractor for this input data source
|
+| -m | --import_mode | No | The Import mode to
use: LOCAL, MR. Default: LOCAL
|
+| -om | --output_mode | No | The Output mode to
use: LOCAL, HDFS. Default: LOCAL
|
+| -i | --input | Yes | The input data
location on local disk. If this is a file, then that file will be loaded. If
this is a directory, then the files will be loaded recursively under that
directory. |
+| -i | --output | Yes | The output data
location. |
--- End diff --
-o here, not -i
---