Re: [DISCUSS] Generating and Interacting with serialized summary objects

Nick Allen Wed, 03 Jan 2018 07:48:37 -0800

> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
this will impact performance


What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
seems really high, unless I am not understanding something.






On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <ceste...@gmail.com> wrote:

> Thanks for the feedback, Nick.
>
> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>
> I would argue that we are not reinventing the wheel for text manipulation
> as the extractor config exists already and we are doing a similar thing in
> the flatfile loader (in fact, the code is reused and merely extended).
> Transformation operations are already supported in our codebase in the
> extractor config, this PR has just added some hooks for stateful
> operations.
>
> Furthermore, we will need a configuration object to pass to the REST call
> if we are ever to create a UI around importing data into hbase or creating
> these summary objects.
>
> Regarding your example:
> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>
> I'm very sympathetic to this type of extension, but it has some issues:
>
>    1. This implies a single-threaded addition to the bloom filter.
>       1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>       think this will impact performance
>       2. There's not a way to specify how to merge across threads if we do
>       make a multithread command line option
>    2. This restricts these kinds of operations to roles with heavy unix CLI
>    knowledge, which isn't often the types of people who would be doing this
>    type of operation
>    3. What if we need two variables passed to stellar?
>    4. This approach will be harder to move to Hadoop.  Eventually we will
>    want to support data on HDFS being processed by Hadoop (similar to
> flatfile
>    loader), so instead of -m LOCAL being passed for the flatfile summarizer
>    you'd pass -m SPARK and the processing would happen on the cluster
>       1. This is particularly relevant in this case as it's a
>       embarrassingly parallel problem in general
>
> In summary, while this a CLI approach is attractive, I prefer the extractor
> config solution because it is the solution with the smallest iteration
> that:
>
>    1. Reuses existing metron extraction infrastructure
>    2. Provides the most solid base for the extensions that will be sorely
>    needed soon (and will keep it in parity with the flatfile loader)
>    3. Provides the most solid base for a future UI extension in the
>    management UI to support both summarization and loading
>
>
>
>
> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <n...@nickallen.org> wrote:
>
> > First off, I really do like the typosquatting use case and a lot of what
> > you have described.
> >
> > > We need a way to generate the summary sketches from flat data for this
> to
> > > work.
> > > ..
> > >
> >
> > I took this quote directly from your use case.  Above is the point that
> I'd
> > like to discuss and what your proposed solutions center on.  This is
> what I
> > think you are trying to do, at least with PR #879
> > <https://github.com/apache/metron/pull/879>...
> >
> > (Q) Can we repurpose Stellar functions so that they can operate on text
> > stored in a file system?
> >
> >
> > Whether we use the (1) Configuration or the (2) Function-based approach
> > that you described, fundamentally we are introducing new ways to perform
> > text manipulation inside of Stellar.
> >
> > IMHO, I'd rather not reinvent the wheel for text manipulation.  It would
> be
> > painful to implement and maintain a bunch of Stellar functions for text
> > manipulation.  People already have a large number of tools available to
> do
> > this and everyone has their favorites.  People are resistant to learning
> > something new when they already are familiar with another way to do the
> > same thing.
> >
> > So then the question is, how else can we do this?  My suggestion is that
> > rather than introducing text manipulation tools inside of Stellar, we
> allow
> > people to use the text manipulation tools they already know, but with the
> > Stellar functions that we already have.  And the obvious way to tie those
> > two things together is the Unix pipeline.
> >
> > A quick, albeit horribly incomplete, example to flesh this out a bit more
> > based on the example you have in PR #879
> > <https://github.com/apache/metron/pull/879>.  This would allow me to
> > integrate Stellar with whatever external tools that I want.
> >
> > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
> > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com>
> wrote:
> >
> > > I'll start this discussion off with my idea around a 2nd step that is
> > more
> > > adaptable.  I propose the following set of stellar functions backed by
> > > Spark in the metron-management project:
> > >
> > >    - CSV_PARSE(location, separator?, columns?) : Constructs a Spark
> > >    Dataframe for reading the flatfile
> > >    - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the
> > > dataframe
> > >    - SUMMARIZE(state_init, state_update, state_merge): Summarize the
> > >    dataframe using the lambda functions:
> > >       - state_init - executed once per worker to initialize the state
> > >       - state_update - executed once per row
> > >       - state_merge - Merge the worker states into one worker state
> > >    - OBJECT_SAVE(obj, output_path) : Save the object obj to the path
> > >    output_path on HDFS.
> > >
> > > This would enable more flexibility and composibility than the
> > > configuration-based approach that we have in the flatfile loader.
> > > My concern with this approach, and the reason I didn't do it initially,
> > was
> > > that I think that users will want at least 2 ways to summarize data (or
> > > load data):
> > >
> > >    - A configuration based approach, which enables a UI
> > >    - A set of stellar functions via the scriptable REPL
> > >
> > > I would argue that both have a place and I started with the
> configuration
> > > based approach as it was a more natural extension of what we already
> had.
> > > I'd love to hear thoughts about this idea too.
> > >
> > >
> > > On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I wanted to get some feedback on a sensible plan for something.  It
> > > > occurred to me the other day when considering the use-case of
> detecting
> > > > typosquatted domains, that one approach was to generate the set of
> > > > typosquatted domains for some set of reference domains and compare
> > > domains
> > > > as they flow through.
> > > >
> > > > One way we could do this would be to generate this data and import
> the
> > > > typosquatted domains into HBase.  I thought, however, that another
> > > approach
> > > > which may trade-off accuracy to remove the network hop and potential
> > disk
> > > > seek by constructing a bloom filter that includes the set of
> > typosquatted
> > > > domains.
> > > >
> > > > The challenge was that we don't have a way to do this currently.  We
> > do,
> > > > however, have a loading infrastructure (e.g. the flatfile_loader) and
> > > > configuration (see https://github.com/apache/
> > metron/tree/master/metron-
> > > > platform/metron-data-management#common-extractor-properties)  which
> > > > handles:
> > > >
> > > >    - parsing flat files
> > > >    - transforming the rows
> > > >    - filtering the rows
> > > >
> > > > To enable the new use-case of generating a summary object (e.g. a
> bloom
> > > > filter), in METRON-1378 (https://github.com/apache/metron/pull/879)
> I
> > > > propose that we create a new utility that uses the same extractor
> > config
> > > > add the ability to:
> > > >
> > > >    - initialize a state object
> > > >    - update the object for every row
> > > >    - merge the state objects (in the case of multiple threads, in the
> > > >    case of one thread it's not needed).
> > > >
> > > > I think this is a sensible decision because:
> > > >
> > > >    - It's a minimal movement from the flat file loader
> > > >       - Uses the same configs
> > > >       - Abstracts and reuses the existing infrastructure
> > > >    - Having one extractor config means that it should be easier to
> > > >    generate a UI around this to simplify the experience
> > > >
> > > > All that being said, our extractor config is..shall we say...daunting
> > :).
> > > > I am sensitive to the fact that this adds to an existing difficult
> > > config.
> > > > I propose that this is an initial step forward to support the
> use-case
> > > and
> > > > we can enable something more composable going forward.  My concern in
> > > > considering this as the first step was that it felt that the
> composable
> > > > units for data transformation and manipulation suddenly takes us
> into a
> > > > place where Stellar starts to look like Pig or Spark RDD API.  I
> wasn't
> > > > ready for that without a lot more discussion.
> > > >
> > > > To summarize, what I'd like to get from the community is, after
> > reviewing
> > > > the entire use-case at https://github.com/cestella/
> > > incubator-metron/tree/
> > > > typosquat_merge/use-cases/typosquat_detection:
> > > >
> > > >    - Is this so confusing that it does not belong in Metron even as a
> > > >    first-step?
> > > >    - Is there a way to extend the extractor config in a less
> confusing
> > > >    way to enable this?
> > > >
> > > > I apologize for making the discuss thread *after* the JIRAs, but I
> felt
> > > > this one might bear having some working code to consider.
> > > >
> > >
> >
>

Re: [DISCUSS] Generating and Interacting with serialized summary objects

Reply via email to