I'll start this discussion off with my idea around a 2nd step that is more
adaptable. I propose the following set of stellar functions backed by
Spark in the metron-management project:
- CSV_PARSE(location, separator?, columns?) : Constructs a Spark
Dataframe for reading the flatfile
- SQL_TRANSFORM(dataframe, spark sql statement): Transforms the dataframe
- SUMMARIZE(state_init, state_update, state_merge): Summarize the
dataframe using the lambda functions:
- state_init - executed once per worker to initialize the state
- state_update - executed once per row
- state_merge - Merge the worker states into one worker state
- OBJECT_SAVE(obj, output_path) : Save the object obj to the path
output_path on HDFS.
This would enable more flexibility and composibility than the
configuration-based approach that we have in the flatfile loader.
My concern with this approach, and the reason I didn't do it initially, was
that I think that users will want at least 2 ways to summarize data (or
load data):
- A configuration based approach, which enables a UI
- A set of stellar functions via the scriptable REPL
I would argue that both have a place and I started with the configuration
based approach as it was a more natural extension of what we already had.
I'd love to hear thoughts about this idea too.
On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <[email protected]> wrote:
> Hi all,
>
> I wanted to get some feedback on a sensible plan for something. It
> occurred to me the other day when considering the use-case of detecting
> typosquatted domains, that one approach was to generate the set of
> typosquatted domains for some set of reference domains and compare domains
> as they flow through.
>
> One way we could do this would be to generate this data and import the
> typosquatted domains into HBase. I thought, however, that another approach
> which may trade-off accuracy to remove the network hop and potential disk
> seek by constructing a bloom filter that includes the set of typosquatted
> domains.
>
> The challenge was that we don't have a way to do this currently. We do,
> however, have a loading infrastructure (e.g. the flatfile_loader) and
> configuration (see https://github.com/apache/metron/tree/master/metron-
> platform/metron-data-management#common-extractor-properties) which
> handles:
>
> - parsing flat files
> - transforming the rows
> - filtering the rows
>
> To enable the new use-case of generating a summary object (e.g. a bloom
> filter), in METRON-1378 (https://github.com/apache/metron/pull/879) I
> propose that we create a new utility that uses the same extractor config
> add the ability to:
>
> - initialize a state object
> - update the object for every row
> - merge the state objects (in the case of multiple threads, in the
> case of one thread it's not needed).
>
> I think this is a sensible decision because:
>
> - It's a minimal movement from the flat file loader
> - Uses the same configs
> - Abstracts and reuses the existing infrastructure
> - Having one extractor config means that it should be easier to
> generate a UI around this to simplify the experience
>
> All that being said, our extractor config is..shall we say...daunting :).
> I am sensitive to the fact that this adds to an existing difficult config.
> I propose that this is an initial step forward to support the use-case and
> we can enable something more composable going forward. My concern in
> considering this as the first step was that it felt that the composable
> units for data transformation and manipulation suddenly takes us into a
> place where Stellar starts to look like Pig or Spark RDD API. I wasn't
> ready for that without a lot more discussion.
>
> To summarize, what I'd like to get from the community is, after reviewing
> the entire use-case at https://github.com/cestella/incubator-metron/tree/
> typosquat_merge/use-cases/typosquat_detection:
>
> - Is this so confusing that it does not belong in Metron even as a
> first-step?
> - Is there a way to extend the extractor config in a less confusing
> way to enable this?
>
> I apologize for making the discuss thread *after* the JIRAs, but I felt
> this one might bear having some working code to consider.
>