Re: [DISCUSS] Top domains enrichment config/extractor management

Michael Miklavcic Fri, 24 Feb 2017 14:36:43 -0800

(1) Agreed on supporting n data sources and their lifecycle. I don't
believe we are currently managing updating the Geo enrichments via Ambari,
but I definitely think this solution should handle that in a
datastore-agnostic way as well. The file is loaded into HDFS, which is a
bit different.
(2) Again agreed on not wanting every enrichment in all environments. And
for supporting multiple enrichment types, I do not believe a dropdown in
Ambari is the appropriate choice. That will most definitely not scale, imho.
(3) Yes
(4) Also yes. I'm leaning towards us providing the ability to load
enrichments via a zip bundle. I had recommended something similar a long
while back for Apache Falcon. It's clean, simple, and allows us to provide
some sort of manifest for defining the import. This also has the advantage
of us being able to potentially version the manifest format and options.
MPack is not an equivalent mechanism for this.



On Fri, Feb 24, 2017 at 7:30 AM, Nick Allen <n...@nickallen.org> wrote:

> >
> >
> > we now have a robust
> >  
> > and flexible means to import enrichment sources and transform their
> > contents as they are inserted into HBase. One of the main motivators for
> > this new functionality was to add the ability to load top domain rankings
> > from sources such as Alexa. The proposal is to make this type of
> enrichment
> > a top-level feature in Metron by introducing it to the Ambari management
> UI
>
>
> (1) In thinking through how the UI should work here, we should consider
> data sources beyond just those that would be loaded in HBase.  I would
> think the UI should be a single view of all data sources, no matter whether
> they load into HBase or not.
>
> It would also be good to think through how the solution might handle
> updating other types of data source, like the geo data, for instance. The
> geo data is something that needs to be updated on a regular basis.  Could
> this solution also manage that?
>
> I know Maxmind has a bit of code to manage updating their data, but I am
> not familiar with what it does or how it works.  Researching that might
> help inform this conversation.
>
>
> > How do folks feel about adding a set of dropdown options in the Ambari UI
> > for loading, updating, and deleting the top domains enrichment?
>
>
> (2) I think if this functionality is truly useful, there is likely going to
> be lots of different data sources that would be made available.  Many of
> which will NOT be applicable or desirable in every environment.
>
> This would be akin to packages or RPMs that are available to install on
> CentOS.  There are many to choose from, but in my specific environment
> there are many that I do not care about.
>
> Is an Ambari drop down scalable considering this usage pattern?
>
> Do we want Ambari to handle only the
> > initial install/load and have end users be responsible on an ongoing
> basis
> > for updates (users would be responsible for copying or distributing the
> > extractor_config.json for instance), or do we want to enable Ambari to
> > manage the configuration ongoing and enable functionality for reloading,
> > updating, and rollback?
>
>
> (3) Whatever solution we land on, it should handle refreshing/reloading the
> data on a regular basis.  This is something that has to be done for almost
> every useful data source and so should be baked into the solution. I don't
> think the functionality is that useful otherwise.
>
> (4) Another thing to consider is extensibility and ease of use.  If we can
> make it really easy to provide a means for loading a data source into
> Metron, then it is more likely that we will have community members willing
> to do that work.
>
> For example, think about the Homebrew project.  They make it stupid simple
> to add a new installable package.  You don't have to know how Homebrew
> works to contribute a package.  The result is they have tons of packages
> available.
>
> Does the Ambari MPack provide the right level of ease of use for that?
>
>
>
>
>
> On Tue, Feb 21, 2017 at 6:31 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > With the work committed in
> > https://github.com/apache/incubator-metron/pull/445 and
> > https://github.com/apache/incubator-metron/pull/432, we now have a
> robust
> > and flexible means to import enrichment sources and transform their
> > contents as they are inserted into HBase. One of the main motivators for
> > this new functionality was to add the ability to load top domain rankings
> > from sources such as Alexa. The proposal is to make this type of
> enrichment
> > a top-level feature in Metron by introducing it to the Ambari management
> UI
> > as a configurable set of properties in the MPack install. This comes with
> > some options and challenges in how we want to manage the configurations,
> > which I will outline below.
> >
> > *Use cases:*
> >
> >    - Single load of top domains file
> >    - Re-loading top domains file - need to be able to cleanup properly
> >    - Cleaning up/deleting old enrichment data (this is a general feature
> >    that we currently lack - I think it is worth a separate Jira/PR for
> >    creating a MapReduce job that enables cleanup to occur).
> >    - Modifying default top domains file source - there are other options
> >    besides Alexa. And users may want to load a file from local URI since
> > many
> >    data centers do not have direct access to the internet.
> >    - Ability to modify the default extractor config JSON and tune the
> >    Stellar transformations for both the value and indicator transforms.
> > Allows
> >    more flexible handling of data based on other sources.
> >    - Loading multiple top domains source enrichments. (Maybe a separate
> PR
> >    for this if we even think it would be useful)
> >    - Updating the top domain enrichment - This needs to be an atomic
> >    operation in order to prevent incorrect data.
> >    - Rolling back to an older version of the top domains enrichment. Also
> >    needs to be atomic.
> >    - Ability to schedule an enrichment load on schedule - we would like
> to
> >    defer this to an external scheduling mechanism, e.g. cron or Control
> M.
> > The
> >    enrichment loading system should have the necessary features to enable
> > this
> >    type of automation without data integrity issues.
> >
> > *Considerations:*
> >
> >    - As mentioned above, we want to add this feature to the Ambari MPack.
> >    This requires at least 2 parameters to work. We need the ability to
> > specify
> >    a URI as well as an extractor config.
> >    - How do we want to manage the extractor config? The most obvious
> >    solution is to provide a text field in Ambari with a default JSON
> > config.
> >    When a load is initiated, Ambari would place a fresh copy of the
> > extractor
> >    config in the /tmp/ directory. This is an ephemeral file that isn't
> > needed
> >    other than during a load.
> >    - It seems easy enough to have the load occur during the initial
> >    install, however subsequent loads would require a different workflow.
> > How
> >    do folks feel about adding a set of dropdown options in the Ambari UI
> > for
> >    loading, updating, and deleting the top domains enrichment? I believe
> we
> >    are doing something similar for the ElasticSearch templates currently.
> >    - In the case of atomic operations for updates and rollbacks, I
> propose
> >    we add a property to Zookeeper that is reference-able in the
> enrichment
> >    itself. The idea would be to create a "top-domains" property in ZK
> that
> >    points to an enrichment key with a load timestamp associated with it,
> > e.g.
> >    top-domains_20170221042000. This would also allow a mapreduce job to
> be
> >    written that cleans up old enrichments. Another option is to create a
> > new
> >    table in HBase if/when you update the enrichment and change the
> > enrichment
> >    config manually. Deleting an old enrichment would simply be a matter
> of
> >    dropping the table in HBase. A relevant discussion of the tradeoffs of
> >    having many small tables versus 1 large table can be found here -
> >    http://grokbase.com/t/hbase/user/11bjbdw94q/multiple-
> > tables-vs-big-fat-table
> >    - In order to update or rollback an enrichment as mentioned above, we
> >    would also ideally provide a mechanism for changing the rowkey pointed
> > to
> >    by the enrichment.
> >
> > In summary of the use cases and considerations above, this boils down to
> > how we'd like to leverage Ambari here. Do we want Ambari to handle only
> the
> > initial install/load and have end users be responsible on an ongoing
> basis
> > for updates (users would be responsible for copying or distributing the
> > extractor_config.json for instance), or do we want to enable Ambari to
> > manage the configuration ongoing and enable functionality for reloading,
> > updating, and rollback?
> >
> > Best,
> > Mike
> >
>

Re: [DISCUSS] Top domains enrichment config/extractor management

Reply via email to