Assuming you have 9 months of data archived, yes. On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic < [email protected]> wrote:
> So in the case of 3 - if you had 6 months of data that hadn't been profiled > and another 3 that had been profiled (9 months total data), in its current > form the batch job runs over all 9 months? > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen <[email protected]> wrote: > > > > How do we establish "tm" from 1.1 above? Any concerns about overlap or > > gaps after the seeding is performed? > > > > Good point. Right now, if the Streaming and Batch Profiler overlap the > > last write wins. And presumably the output of the Streaming and Batch > > Profiler are the same, so no worries, right? :) > > > > So it kind of works, but it is definitely not ideal for use case 3. I > > could add --begin and --end args to constrain the time frame over which > the > > Batch Profiler runs. I do not have that in the feature branch. It would > > be easy enough to add though. > > > > > > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > > [email protected]> wrote: > > > > > Ok, makes sense. That's sort of what I was thinking as well, Nick. > > Pulling > > > at this thread just a bit more... > > > > > > 1. I have an existing system that's been up a while, and I have > added > > k > > > profiles - assume these are the first profiles I've created. > > > 1. I would have t0 - tm (where m is the time when the profiles > were > > > first installed) worth of data that has not been profiled yet. > > > 2. The batch profiler process would be to take that exact profile > > > definition from ZK and run the batch loader with that from the > CLI. > > > 3. Profiles are now up to date from t0 - tCurrent > > > 2. I've already done #1 above. Time goes by and now I want to add a > > new > > > profile. > > > 1. Same first step above > > > 2. I would run the batch loader with *only* that new profile > > > definition to seed? > > > > > > Forgive me if I missed this in PR's and discussion in the FB, but how > do > > we > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps after > > the > > > seeding is performed? > > > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen <[email protected]> > wrote: > > > > > > > I think more often than not, you would want to load your profile > > > definition > > > > from a file. This is why I considered the 'load from Zk' more of a > > > > nice-to-have. > > > > > > > > - In use case 1 and 2, this would definitely be the case. The > > > profiles > > > > I am working with are speculative and I am using the batch > profiler > > to > > > > determine if they are worth keeping. In this case, my speculative > > > > profiles > > > > would not be in Zk (yet). > > > > - In use case 3, I could see it go either way. It might be useful > > to > > > > load from Zk, but it certainly isn't a blocker. > > > > > > > > > > > > > So if the config does not correctly match the profiler config held > in > > > ZK > > > > and > > > > the user runs the batch seeding job, what happens? > > > > > > > > You would just get a profile that is slightly different over the > entire > > > > time span. This is not a new risk. If the user changes their > Profile > > > > definitions in Zk, the same thing would happen. > > > > > > > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > > > [email protected]> wrote: > > > > > > > > > I think I'm torn on this, specifically because it's batch and would > > > > > generally be run as-needed. Justin, can you elaborate on your > > concerns > > > > > there? This feels functionally very similar to our flat file > loaders, > > > > which > > > > > all have inputs for config from the CLI only. On the other hand, > our > > > flat > > > > > file loaders are not typically seeding an existing structure. My > > > concern > > > > of > > > > > a local file profiler config stems from this stated goal: > > > > > > The goal would be to enable “profile seeding” which allows > profiles > > > to > > > > be > > > > > populated from a time before the profile was created. > > > > > So if the config does not correctly match the profiler config held > in > > > ZK > > > > > and the user runs the batch seeding job, what happens? > > > > > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet < > [email protected]> > > > > > wrote: > > > > > > > > > > > The profile not being able to read from ZK feels like a fairly > > > > > substantial, > > > > > > if subtle, set of potential problems. I'd like to see that in > > either > > > > > > before merging or at least pretty soon after merging. Is it a > lot > > of > > > > > work > > > > > > to add that functionality based on where things are right now? > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen <[email protected]> > > > wrote: > > > > > > > > > > > > > Here is another limitation that I just thought. It can only > read > > a > > > > > > profile > > > > > > > definition from a file. It probably also makes sense to add an > > > > option > > > > > > that > > > > > > > allows it to read the current Profiler configuration from > > > Zookeeper. > > > > > > > > > > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the > > main > > > > > > indexing > > > > > > > output? > > > > > > > > > > > > > > Yes, I think that makes sense. We want the Batch Profiler to > > point > > > > to > > > > > > the > > > > > > > right HDFS URL, no matter where/how Metron is deployed. When > > > Metron > > > > > gets > > > > > > > spun-up on a cluster, I should be able to just run the Batch > > > Profiler > > > > > > > without having to fuss with the input path. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Re: > > > > > > > > > > > > > > > > > * You do not configure the Batch Profiler in Ambari. It > is > > > > > > configured > > > > > > > > > and executed completely from the command-line. > > > > > > > > > > > > > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the > > main > > > > > > indexing > > > > > > > > output? I'm a little on the fence about it, but it seems > like > > > > making > > > > > > the > > > > > > > > most common case more or less built-in would be nice. > > > > > > > > > > > > > > > > Having said that, I do not consider that a requirement for > > > merging > > > > > the > > > > > > > > feature branch. > > > > > > > > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > I think what you have outlined above is a good initial stab > > at > > > > the > > > > > > > > > feature. Manual install of spark is not a big deal. > > > Configuring > > > > > via > > > > > > > > > command line while we mature this feature is ok as well. > > > Doesn't > > > > > > look > > > > > > > > like > > > > > > > > > configuration steps are too hard. I think you should > merge. > > > > > > > > > > > > > > > > > > James > > > > > > > > > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" <[email protected]>: > > > > > > > > > > I would like to open a discussion to get the Batch > Profiler > > > > > feature > > > > > > > > > branch > > > > > > > > > > merged into master as part of METRON-1699 [1] Create > Batch > > > > > > Profiler. > > > > > > > > All > > > > > > > > > > of the work that I had in mind for our first draft of the > > > Batch > > > > > > > > Profiler > > > > > > > > > > has been completed. Please take a look through what I > have > > > and > > > > > let > > > > > > me > > > > > > > > > know > > > > > > > > > > if there are other features that you think are required > > > > *before* > > > > > we > > > > > > > > > merge. > > > > > > > > > > > > > > > > > > > > Previous list discussions on this topic include [2] and > > [3]. > > > > > > > > > > > > > > > > > > > > (Q) What can I do with the feature branch? > > > > > > > > > > > > > > > > > > > > * With the Batch Profiler, you can backfill/seed > profiles > > > > using > > > > > > > > > archived > > > > > > > > > > telemetry. This enables the following types of use cases. > > > > > > > > > > > > > > > > > > > > 1. As a Security Data Scientist, I want to > understand > > > the > > > > > > > > > historical > > > > > > > > > > behaviors and trends of a profile that I have created so > > > that I > > > > > can > > > > > > > > > > determine if I have created a feature set that has > > predictive > > > > > value > > > > > > > for > > > > > > > > > > model building. > > > > > > > > > > > > > > > > > > > > 2. As a Security Data Scientist, I want to > understand > > > the > > > > > > > > > historical > > > > > > > > > > behaviors and trends of a profile that I have created so > > > that I > > > > > can > > > > > > > > > > determine if I have defined the profile correctly and > > > created a > > > > > > > feature > > > > > > > > > set > > > > > > > > > > that matches reality. > > > > > > > > > > > > > > > > > > > > 3. As a Security Platform Engineer, I want to > > generate > > > a > > > > > > > profile > > > > > > > > > > using archived telemetry when I deploy a new model to > > > > production > > > > > so > > > > > > > > that > > > > > > > > > > models depending on that profile can function on day 1. > > > > > > > > > > > > > > > > > > > > * METRON-1699 [1] includes a more detailed description > of > > > the > > > > > > > > feature. > > > > > > > > > > > > > > > > > > > > (Q) What work was completed? > > > > > > > > > > > > > > > > > > > > * The Batch Profiler runs on Spark and was implemented > in > > > > Java > > > > > to > > > > > > > > > remain > > > > > > > > > > consistent with our current Java-heavy code base. > > > > > > > > > > > > > > > > > > > > * The Batch Profiler is executed from the command-line. > > It > > > > can > > > > > be > > > > > > > > > > launched using a script or by calling `spark-submit`, > which > > > may > > > > > be > > > > > > > > useful > > > > > > > > > > for advanced users. > > > > > > > > > > > > > > > > > > > > * Input telemetry can be consumed from multiple > sources; > > > for > > > > > > > example > > > > > > > > > HDFS > > > > > > > > > > or the local file system. > > > > > > > > > > > > > > > > > > > > * Input telemetry can be consumed in multiple formats; > > for > > > > > > example > > > > > > > > JSON > > > > > > > > > > or ORC. > > > > > > > > > > > > > > > > > > > > * The 'output' profile measurements are persisted in > > HBase > > > > and > > > > > is > > > > > > > > > > consistent with the Storm Profiler. > > > > > > > > > > > > > > > > > > > > * It can be run on any underlying engine supported by > > > Spark. > > > > I > > > > > > have > > > > > > > > > > tested it both in 'local' mode and on a YARN cluster. > > > > > > > > > > > > > > > > > > > > * It is installed automatically by the Metron MPack. > > > > > > > > > > > > > > > > > > > > * A README was added that documents usage instructions. > > > > > > > > > > > > > > > > > > > > * The existing Profiler code was refactored so that as > > much > > > > > code > > > > > > as > > > > > > > > > > possible is shared between the 3 Profiler ports; Storm, > the > > > > > Stellar > > > > > > > > REPL, > > > > > > > > > > and Spark. For example, the logic which determines the > > > > timestamp > > > > > > of a > > > > > > > > > > message was refactored so that it could be reused by all > > > ports. > > > > > > > > > > > > > > > > > > > > * metron-profiler-common: The common Profiler code > > > shared > > > > > > > amongst > > > > > > > > > > each port. > > > > > > > > > > * metron-profiler-storm: Profiler on Storm > > > > > > > > > > * metron-profiler-spark: Profiler on Spark > > > > > > > > > > * metron-profiler-repl: Profiler on the Stellar > REPL > > > > > > > > > > * metron-profiler-client: The client code for > > > retrieving > > > > > > > profile > > > > > > > > > > data; for example PROFILE_GET. > > > > > > > > > > > > > > > > > > > > * There are 3 separate RPM and DEB packages now created > > for > > > > the > > > > > > > > > Profiler. > > > > > > > > > > > > > > > > > > > > * metron-profiler-storm-*.rpm > > > > > > > > > > * metron-profiler-spark-*.rpm > > > > > > > > > > * metron-profiler-repl-*.rpm > > > > > > > > > > > > > > > > > > > > * The Profiler integration tests were enhanced to > > leverage > > > > the > > > > > > > > Profiler > > > > > > > > > > Client logic to validate the results. > > > > > > > > > > > > > > > > > > > > * Review METRON-1699 [1] for a complete break-down of > the > > > > tasks > > > > > > > that > > > > > > > > > have > > > > > > > > > > been completed on the feature branch. > > > > > > > > > > > > > > > > > > > > (Q) What limitations exist? > > > > > > > > > > > > > > > > > > > > * You must manually install Spark to use the Batch > > > Profiler. > > > > > The > > > > > > > > Metron > > > > > > > > > > MPack does not treat Spark as a Metron dependency and so > > does > > > > not > > > > > > > > install > > > > > > > > > > it automatically. > > > > > > > > > > > > > > > > > > > > * You do not configure the Batch Profiler in Ambari. It > > is > > > > > > > configured > > > > > > > > > > and executed completely from the command-line. > > > > > > > > > > > > > > > > > > > > * To run the Batch Profiler in 'Full Dev', you have to > > take > > > > the > > > > > > > > > following > > > > > > > > > > manual steps. Some of these are arguably limitations with > > how > > > > > > Ambari > > > > > > > > > > installs Spark 2 in the version of HDP that we run. > > > > > > > > > > > > > > > > > > > > 1. Install Spark 2 using Ambari. > > > > > > > > > > > > > > > > > > > > 2. Tell Spark how to talk with HBase. > > > > > > > > > > > > > > > > > > > > SPARK_HOME=/usr/hdp/current/spark2-client > > > > > > > > > > cp > > /usr/hdp/current/hbase-client/conf/hbase-site.xml > > > > > > > > > > $SPARK_HOME/conf/ > > > > > > > > > > > > > > > > > > > > 3. Create the Spark History directory in HDFS. > > > > > > > > > > > > > > > > > > > > export HADOOP_USER_NAME=hdfs > > > > > > > > > > hdfs dfs -mkdir /spark2-history > > > > > > > > > > > > > > > > > > > > 4. Change the default input path to > > > > > > `hdfs://localhost:8020/...` > > > > > > > > to > > > > > > > > > > match the port defined by HDP, instead of port 9000. > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/METRON-1699 > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/da81c1227ffda3a47eb2e5bb4d0b162dd6d36006241c4ba4b659587b@%3Cdev.metron.apache.org%3E > > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4ee601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E > > > > > > > > > > > > > > > > > > ------------------- > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > James Sirota > > > > > > > > > PMC- Apache Metron > > > > > > > > > jsirota AT apache DOT org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
