> Even with 5 threads, it takes an hour for the full Alexa 1m, so I think this will impact performance
What exactly takes an hour? Adding 1M entries to a bloom filter? That seems really high, unless I am not understanding something. On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella <ceste...@gmail.com> wrote: > Thanks for the feedback, Nick. > > Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation." > > I would argue that we are not reinventing the wheel for text manipulation > as the extractor config exists already and we are doing a similar thing in > the flatfile loader (in fact, the code is reused and merely extended). > Transformation operations are already supported in our codebase in the > extractor config, this PR has just added some hooks for stateful > operations. > > Furthermore, we will need a configuration object to pass to the REST call > if we are ever to create a UI around importing data into hbase or creating > these summary objects. > > Regarding your example: > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > I'm very sympathetic to this type of extension, but it has some issues: > > 1. This implies a single-threaded addition to the bloom filter. > 1. Even with 5 threads, it takes an hour for the full alexa 1m, so I > think this will impact performance > 2. There's not a way to specify how to merge across threads if we do > make a multithread command line option > 2. This restricts these kinds of operations to roles with heavy unix CLI > knowledge, which isn't often the types of people who would be doing this > type of operation > 3. What if we need two variables passed to stellar? > 4. This approach will be harder to move to Hadoop. Eventually we will > want to support data on HDFS being processed by Hadoop (similar to > flatfile > loader), so instead of -m LOCAL being passed for the flatfile summarizer > you'd pass -m SPARK and the processing would happen on the cluster > 1. This is particularly relevant in this case as it's a > embarrassingly parallel problem in general > > In summary, while this a CLI approach is attractive, I prefer the extractor > config solution because it is the solution with the smallest iteration > that: > > 1. Reuses existing metron extraction infrastructure > 2. Provides the most solid base for the extensions that will be sorely > needed soon (and will keep it in parity with the flatfile loader) > 3. Provides the most solid base for a future UI extension in the > management UI to support both summarization and loading > > > > > On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen <n...@nickallen.org> wrote: > > > First off, I really do like the typosquatting use case and a lot of what > > you have described. > > > > > We need a way to generate the summary sketches from flat data for this > to > > > work. > > > .. > > > > > > > I took this quote directly from your use case. Above is the point that > I'd > > like to discuss and what your proposed solutions center on. This is > what I > > think you are trying to do, at least with PR #879 > > <https://github.com/apache/metron/pull/879>... > > > > (Q) Can we repurpose Stellar functions so that they can operate on text > > stored in a file system? > > > > > > Whether we use the (1) Configuration or the (2) Function-based approach > > that you described, fundamentally we are introducing new ways to perform > > text manipulation inside of Stellar. > > > > IMHO, I'd rather not reinvent the wheel for text manipulation. It would > be > > painful to implement and maintain a bunch of Stellar functions for text > > manipulation. People already have a large number of tools available to > do > > this and everyone has their favorites. People are resistant to learning > > something new when they already are familiar with another way to do the > > same thing. > > > > So then the question is, how else can we do this? My suggestion is that > > rather than introducing text manipulation tools inside of Stellar, we > allow > > people to use the text manipulation tools they already know, but with the > > Stellar functions that we already have. And the obvious way to tie those > > two things together is the Unix pipeline. > > > > A quick, albeit horribly incomplete, example to flesh this out a bit more > > based on the example you have in PR #879 > > <https://github.com/apache/metron/pull/879>. This would allow me to > > integrate Stellar with whatever external tools that I want. > > > > $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i > > 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)' > > > > > > > > > > > > > > > > > > On Sun, Dec 24, 2017 at 8:28 PM, Casey Stella <ceste...@gmail.com> > wrote: > > > > > I'll start this discussion off with my idea around a 2nd step that is > > more > > > adaptable. I propose the following set of stellar functions backed by > > > Spark in the metron-management project: > > > > > > - CSV_PARSE(location, separator?, columns?) : Constructs a Spark > > > Dataframe for reading the flatfile > > > - SQL_TRANSFORM(dataframe, spark sql statement): Transforms the > > > dataframe > > > - SUMMARIZE(state_init, state_update, state_merge): Summarize the > > > dataframe using the lambda functions: > > > - state_init - executed once per worker to initialize the state > > > - state_update - executed once per row > > > - state_merge - Merge the worker states into one worker state > > > - OBJECT_SAVE(obj, output_path) : Save the object obj to the path > > > output_path on HDFS. > > > > > > This would enable more flexibility and composibility than the > > > configuration-based approach that we have in the flatfile loader. > > > My concern with this approach, and the reason I didn't do it initially, > > was > > > that I think that users will want at least 2 ways to summarize data (or > > > load data): > > > > > > - A configuration based approach, which enables a UI > > > - A set of stellar functions via the scriptable REPL > > > > > > I would argue that both have a place and I started with the > configuration > > > based approach as it was a more natural extension of what we already > had. > > > I'd love to hear thoughts about this idea too. > > > > > > > > > On Sun, Dec 24, 2017 at 8:20 PM, Casey Stella <ceste...@gmail.com> > > wrote: > > > > > > > Hi all, > > > > > > > > I wanted to get some feedback on a sensible plan for something. It > > > > occurred to me the other day when considering the use-case of > detecting > > > > typosquatted domains, that one approach was to generate the set of > > > > typosquatted domains for some set of reference domains and compare > > > domains > > > > as they flow through. > > > > > > > > One way we could do this would be to generate this data and import > the > > > > typosquatted domains into HBase. I thought, however, that another > > > approach > > > > which may trade-off accuracy to remove the network hop and potential > > disk > > > > seek by constructing a bloom filter that includes the set of > > typosquatted > > > > domains. > > > > > > > > The challenge was that we don't have a way to do this currently. We > > do, > > > > however, have a loading infrastructure (e.g. the flatfile_loader) and > > > > configuration (see https://github.com/apache/ > > metron/tree/master/metron- > > > > platform/metron-data-management#common-extractor-properties) which > > > > handles: > > > > > > > > - parsing flat files > > > > - transforming the rows > > > > - filtering the rows > > > > > > > > To enable the new use-case of generating a summary object (e.g. a > bloom > > > > filter), in METRON-1378 (https://github.com/apache/metron/pull/879) > I > > > > propose that we create a new utility that uses the same extractor > > config > > > > add the ability to: > > > > > > > > - initialize a state object > > > > - update the object for every row > > > > - merge the state objects (in the case of multiple threads, in the > > > > case of one thread it's not needed). > > > > > > > > I think this is a sensible decision because: > > > > > > > > - It's a minimal movement from the flat file loader > > > > - Uses the same configs > > > > - Abstracts and reuses the existing infrastructure > > > > - Having one extractor config means that it should be easier to > > > > generate a UI around this to simplify the experience > > > > > > > > All that being said, our extractor config is..shall we say...daunting > > :). > > > > I am sensitive to the fact that this adds to an existing difficult > > > config. > > > > I propose that this is an initial step forward to support the > use-case > > > and > > > > we can enable something more composable going forward. My concern in > > > > considering this as the first step was that it felt that the > composable > > > > units for data transformation and manipulation suddenly takes us > into a > > > > place where Stellar starts to look like Pig or Spark RDD API. I > wasn't > > > > ready for that without a lot more discussion. > > > > > > > > To summarize, what I'd like to get from the community is, after > > reviewing > > > > the entire use-case at https://github.com/cestella/ > > > incubator-metron/tree/ > > > > typosquat_merge/use-cases/typosquat_detection: > > > > > > > > - Is this so confusing that it does not belong in Metron even as a > > > > first-step? > > > > - Is there a way to extend the extractor config in a less > confusing > > > > way to enable this? > > > > > > > > I apologize for making the discuss thread *after* the JIRAs, but I > felt > > > > this one might bear having some working code to consider. > > > > > > > > > >