[GitHub] incubator-metron issue #414: METRON-532 Define Profile Period When Calling P...

2017-01-12 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/414
  
Couple additional fixes to the documentation.  Statements about storage of 
the writer and client configs were incorrect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Kyle Richardson
I'll second my preference for the first option. I think the ability to use
Stellar filters to customize indexing would be a big win.

I'm glad Matt brought up the point about data lake and CEP. I think this is
a really important use case that we need to consider. Take a simple
example... If I have data coming in from 3 different firewall vendors and 2
different web proxy/url filtering vendors and I want to be able to analyze
that data set, I need the data to be indexed all together (likely in HDFS)
and to have a normalized schema such that IP address, URL, and user name
(to take a few) can be easily queried and aggregated. I can also envision
scenarios where I would want to index data based on attributes other than
sensor, business unit or subsidiary for example.

I've been wanted to propose extending our 7 standard fields to include
things like URL and user. Is there community interest/support for moving in
that direction? If so, I'll start a new thread.

Thanks!

-Kyle

On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:

> Ah, I see.  If overriding the default index name allows using the same
> name for multiple sensors, then the goal can be achieved.
> Thanks,
> --Matt
>
>
> On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
>
> Oh, you could!  Let's say you have a syslog parser with data from
> sources 1
> 2 and 3.  You'd end up with one kafka queue with 3 parsers attached to
> that
> queue, each picking part the messages from source 1, 2 and 3.  They'd
> go
> through separate enrichment and into the indexing topology.  In the
> indexing topology, you could specify the same index name "syslog" and
> all
> of the messages go into the same index for CEP querying if so desired.
>
> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley  wrote:
>
> > Syslog is hell on parsers – I know, I worked at LogLogic in a
> previous
> > life.  It makes perfect sense to route different lines from syslog
> through
> > different appropriate parsers.  But a lot of what the parsers do is
> > identify consistent subsets of metadata and annotate it – eg,
> src_ip_addr,
> > event timestamps, etc.  Once those metadata are annotated and
> available
> > with common field names, why doesn’t it make sense to index the
> messages
> > together, for CEP querying?  I think Splunk has illustrated this
> model.
> >
> > On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
> >
> > yeah, I mean, honestly, I think the approach that we've taken for
> > sources
> > which aggregate different types of data is to provide filters at
> the
> > parser
> > level and have multiple parser topologies (with different,
> possibly
> > mutually exclusive filters) running.  This would be a completely
> > separate
> > sensor.  Imagine a syslog data source that aggregates and you
> want to
> > pick
> > apart certain pieces of messages.  This is why the initial
> thought and
> > architecture was one index per sensor.
> >
> > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley 
> wrote:
> >
> > > I’m thinking that CEP (Complex Event Processing) is contrary
> to the
> > idea
> > > of silo-ing data per sensor.
> > > Now it’s true that some of those sensors are already
> aggregating
> > data from
> > > multiple sources, so maybe I’m wrong here.
> > > But it just seems to me that the “data lake” insights come from
> > being able
> > > to make decisions over the whole mass of data rather than just
> > vertical
> > > slices of it.
> > >
> > > On 1/12/17, 2:15 PM, "Casey Stella" 
> wrote:
> > >
> > > Hey Matt,
> > >
> > > Thanks for the comment!
> > > 1. At the moment, we only have one index name, the default
> of
> > which is
> > > the
> > > sensor name but that's entirely up to the user.  This is
> sensor
> > > specific,
> > > so it'd be a separate config for each sensor.  If we want
> to
> > build
> > > multiple
> > > indices per sensor, we'd have to think carefully about how
> to do
> > that
> > > and
> > > would be a bigger undertaking.  I guess I can see the use,
> though
> > > (redirect
> > > messages to one index vs another based on a predicate for
> a given
> > > sensor).
> > > Anyway, not where I was originally thinking that this
> discussion
> > would
> > > go,
> > > but it's an interesting point.
> > >
> > > 2. I hadn't thought through the implementation quite yet,
> but we
> > don't
> > > actually have a splitter bolt in that topology, just a
> spout
> > that goes
> > > to
> > > the elasticsearch writer and also to the hdfs writer.
> > >
> > > On Thu, Jan 12, 2017 a

Re: [PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Matt Foley
The Spark docs sure are pretty.  I suspect there’s a lot of person-weeks of 
work behind the content.  I don’t know how hard it was to set up the 
infrastructure, but the instructions for generating the site mention an 
impressive list of tools needed.

The Falcon docs site seems much more straightforward, and reasonably pretty 
too.  I can take a little time to understand it better.

Thanks,
--Matt


On 1/12/17, 6:19 PM, "Kyle Richardson"  wrote:

Matt, thanks for pulling this together. I completely agree that we need to
go all in on either cwiki or the README.md's. I think the wiki is poorly
updated and can cause confusion for new users and devs. My preference is
certainly for the README.md's.

I like your approach but also agree that we shouldn't need to roll our own
here. I really like the Spark documentation that Mike pointed out. Any way
we can duplicate/adapt their approach?

-Kyle

On Thu, Jan 12, 2017 at 7:19 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Casey, Matt - These guys are using doxia
> https://github.com/apache/falcon/tree/master/docs
>
> Honestly, I kind of like Spark's approach -
> https://github.com/apache/spark/tree/master/docs
>
> Mike
>
> On Thu, Jan 12, 2017 at 4:48 PM, Matt Foley  wrote:
>
> > I’m ambivalent; I think we’d end up tied to the doxia processing
> pipeline,
> > which is “yet another arcane toolset” to learn.  Using .md as the input
> > format decreases the dependency, but we’d still be dependent on it.
> >
> > I had anticipated that the web page would be a write-once thing that
> would
> > be only a couple days for an experienced Web developer. But I was going
> to
> > get an estimate from some co-workers before actually trying to get it
> > implemented. And the script is a few hours of work with find and awk.
> >
> > On the other hand, doxia is certainly an expectable solution.  Is 
setting
> > up that infrastructure less work than developing the web page?  Or is it
> > actually just a matter of a few lines in pom.xml?
> >
> >
> > On 1/12/17, 3:24 PM, "Casey Stella"  wrote:
> >
> > Just a followup thought that's a bit more constructive, maybe we
> could
> > migrate the README.md's into a site directory and use doxia markdown
> > (example here ) to
> > generate the site as part of the build to resolve 1 through 3?
> >
> > On Thu, Jan 12, 2017 at 6:19 PM, Casey Stella 
> > wrote:
> >
> > > So, I do think this would be better than what we currently do.  I
> > like a
> > > few things in particular:
> > >
> > >- I don't like the wiki one bit.
> > >- We have a LOT of documentation in the README.md's and it's
> > sometimes
> > >poorly organized
> > >- I like a documentation preprocessing pipeline to be present.
> > For
> > >instance, a major ask is all of the stellar functions in one
> > place.  That's
> > >solved by updating an index manually in the READMEs and keeping
> > it in sync
> > >with the annotation.  I'd like to make a stellar annotation ->
> > markdown
> > >generator as part of the build and this would be nice for such 
a
> > task.
> > >
> > > My only concern is that the html generation/viewer seems like a
> fair
> > > amount of engineering.  Are you sure there isn't something easier
> > that we
> > > could conform to?  I'm sure we aren't the only project in the 
world
> > that
> > > has this particular issue.  Is there something like a maven site
> > plugin or
> > > something?  Just a thought.  I'll come back with more :)
> > >
> > > Great ideas!  Keep them coming!
> > >
> > > Casey
> > >
> > > On Thu, Jan 12, 2017 at 6:05 PM, Matt Foley 
> > wrote:
> > >
> > >> We currently have three forms of documentation, with the 
following
> > >> advantages and disadvantages:
> > >>
> > >> || Docs || Pro || Con ||
> > >> | CWiki |
> > >>   Easy to edit, no special tools required, don't have to be a
> > >> developer to contribute, google and wiki search |
> > >> Not versioned, no review process, distant from the code, obsolete
> > content
> > >> tends to accumulate |
> > >> | Site |
> > >>   Versioned and reviewed, only committers can edit, google
> > search |
> > >>   Yet another arcane toolset must be learned, only web
> > programmers
> > >> feel comfortable contributing, "asf-site" branch not related to
> code
> > >> versions, distant from the code, tends to go obsolete due to

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-12 Thread Matt Foley
Mike, could you try again on the image, please, making sure it is a simple 
format (gif, png, or jpeg)?  It got munched, at least in my viewer.  Thanks.

Casey, responding to some of the questions you raised:

I’m going to make a rather strong statement:  We already have a service “to 
intermediate and handle config update/retrieval”.  
Furthermore, it:
- Correctly handles the problems of distributed services running on multi-node 
clusters.  (That’s a HARD problem, people, and we shouldn’t try to reinvent the 
wheel.)
- Correctly handles Kerberos security. (That’s kinda hard too, or at least a 
lot of work.)
- It does automatic versioning of configurations, and allows viewing, 
comparing, and reverting historical configs
- It has a capable REST API for all those things.
It doesn’t natively integrate Zookeeper storage of configs, but there is a 
natural place to specify copy to/from Zookeeper for the files desired.

It is Ambari.  And we should commit to it, rather than try to re-create such 
features.
Because it has a good REST API, it is perfectly feasible to implement Stellar 
functions that call it.
GUI configuration tools can also use the Ambari APIs, or better yet be 
integrated in an “Ambari View”. (Eg, see the “Yarn Capacity Scheduler 
Configuration Tool” example in the Ambari documentation, under “Using Ambari 
Views”.)

Arguments are: Parsimony, Sufficiency, Not reinventing the wheel, and Not 
spending weeks and weeks of developer time over the next year reinventing the 
wheel while getting details wrong multiple times…

Okay, off soapbox.  

Casey asked what the config update behavior of Ambari is, and how it will 
interact with changes made from outside Ambari.
The following is from my experience working with the Ambari Mpack for Metron.  
I am not otherwise an Ambari expert, so tomorrow I’ll get it reviewed by an 
Ambari development engineer.

Ambari-server runs on one node, and Ambari-agent runs on each of all the nodes.
Ambari-server has a private set of py, xml, and template files, which together 
are used both to generate the Ambari configuration GUI, with defaults, and to 
generate configuration files (of any needed filetype) for the various Stack 
components.
Ambari-server also has a database where it stores the schema related to these 
files, so even if you reach in and edit Ambari’s files, it will Error out if 
the set of parameters or parameter names changes.  The historical information 
about configuration changes is also stored in the db.
For each component (and in the case of Metron, for each topology), there is a 
python file which controls the logic for these actions, among others:
- Install
- Start / stop / restart / status
- Configure

It is actually up to this python code (which we wrote for the Metron Mpack) 
what happens in each of these API calls.  But the current code, and I believe 
this is typical of Ambari-managed components, performs a “Configure” action 
whenever you press the “Save” button after changing a component config in 
Ambari, and also on each Install and Start or Restart.

The Configure action consists of approximately the following sequence (see 
disclaimer above :-)
- Recreate the generated config files, using the template files and the actual 
configuration most recently set in Ambari
o Note this is also under the control of python code that we wrote, and this is 
the appropriate place to push to ZK if desired.
- Propagate those config files to each Ambari-agent, with a command to set them 
locally
- The ambari-agents on each node receive the files and write them to the 
specified locations on local storage

Ambari-server then whines that the updated services should be restarted, but 
does not initiate that action itself (unless of course the initiating action 
was a Start command from the administrator).

Make sense?  It’s all quite straightforward in concept, there’s just an awful 
lot of stuff wrapped around that to make it all go smoothly and handle the 
problems when it doesn’t.

There’s additional complexity in that the Ambari-agent also caches (on each 
node) both the template files and COMPILED forms of the python files (.pyc) 
involved in transforming them.  The pyc files incorporate some amount of 
additional info regarding parameter values, but I’m not sure of the form.  I 
don’t think that changes the above in any practical way unless you’re trying to 
cheat Ambari by reaching in and editing its files directly.  In that case, you 
also need to whack the pyc files (on each node) to force the data to be 
reloaded from Ambari-server.  Best solution is don’t cheat.

Also, there may be circumstances under which the Ambari-agent will detect 
changes and re-write the latest version it knows of the config files, even 
without a Save or Start action at the Ambari-server.  I’m not sure of this and 
need to check with Ambari developers.  It may no longer happen, altho I’m 
pretty sure change detection/reversion was a feature of early versions of 
Ambari.

Re: [PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Kyle Richardson
Matt, thanks for pulling this together. I completely agree that we need to
go all in on either cwiki or the README.md's. I think the wiki is poorly
updated and can cause confusion for new users and devs. My preference is
certainly for the README.md's.

I like your approach but also agree that we shouldn't need to roll our own
here. I really like the Spark documentation that Mike pointed out. Any way
we can duplicate/adapt their approach?

-Kyle

On Thu, Jan 12, 2017 at 7:19 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Casey, Matt - These guys are using doxia
> https://github.com/apache/falcon/tree/master/docs
>
> Honestly, I kind of like Spark's approach -
> https://github.com/apache/spark/tree/master/docs
>
> Mike
>
> On Thu, Jan 12, 2017 at 4:48 PM, Matt Foley  wrote:
>
> > I’m ambivalent; I think we’d end up tied to the doxia processing
> pipeline,
> > which is “yet another arcane toolset” to learn.  Using .md as the input
> > format decreases the dependency, but we’d still be dependent on it.
> >
> > I had anticipated that the web page would be a write-once thing that
> would
> > be only a couple days for an experienced Web developer. But I was going
> to
> > get an estimate from some co-workers before actually trying to get it
> > implemented. And the script is a few hours of work with find and awk.
> >
> > On the other hand, doxia is certainly an expectable solution.  Is setting
> > up that infrastructure less work than developing the web page?  Or is it
> > actually just a matter of a few lines in pom.xml?
> >
> >
> > On 1/12/17, 3:24 PM, "Casey Stella"  wrote:
> >
> > Just a followup thought that's a bit more constructive, maybe we
> could
> > migrate the README.md's into a site directory and use doxia markdown
> > (example here ) to
> > generate the site as part of the build to resolve 1 through 3?
> >
> > On Thu, Jan 12, 2017 at 6:19 PM, Casey Stella 
> > wrote:
> >
> > > So, I do think this would be better than what we currently do.  I
> > like a
> > > few things in particular:
> > >
> > >- I don't like the wiki one bit.
> > >- We have a LOT of documentation in the README.md's and it's
> > sometimes
> > >poorly organized
> > >- I like a documentation preprocessing pipeline to be present.
> > For
> > >instance, a major ask is all of the stellar functions in one
> > place.  That's
> > >solved by updating an index manually in the READMEs and keeping
> > it in sync
> > >with the annotation.  I'd like to make a stellar annotation ->
> > markdown
> > >generator as part of the build and this would be nice for such a
> > task.
> > >
> > > My only concern is that the html generation/viewer seems like a
> fair
> > > amount of engineering.  Are you sure there isn't something easier
> > that we
> > > could conform to?  I'm sure we aren't the only project in the world
> > that
> > > has this particular issue.  Is there something like a maven site
> > plugin or
> > > something?  Just a thought.  I'll come back with more :)
> > >
> > > Great ideas!  Keep them coming!
> > >
> > > Casey
> > >
> > > On Thu, Jan 12, 2017 at 6:05 PM, Matt Foley 
> > wrote:
> > >
> > >> We currently have three forms of documentation, with the following
> > >> advantages and disadvantages:
> > >>
> > >> || Docs || Pro || Con ||
> > >> | CWiki |
> > >>   Easy to edit, no special tools required, don't have to be a
> > >> developer to contribute, google and wiki search |
> > >> Not versioned, no review process, distant from the code, obsolete
> > content
> > >> tends to accumulate |
> > >> | Site |
> > >>   Versioned and reviewed, only committers can edit, google
> > search |
> > >>   Yet another arcane toolset must be learned, only web
> > programmers
> > >> feel comfortable contributing, "asf-site" branch not related to
> code
> > >> versions, distant from the code, tends to go obsolete due to
> > >> non-maintenance |
> > >> | README.md |
> > >>   Versioned and reviewed, only committers can edit, tied to
> code
> > >> versions, highly local to the code being documented |
> > >>   Non-developers don't know about them, may be scared by
> > github, poor
> > >> scoring in google search, no high-level presentation |
> > >>
> > >> Various discussion threads indicate the developer community likes
> > >> README-based docs, and it's easy to see why from the above.  I
> > propose this
> > >> extension to the README-based documentation, to address their
> > disadvantages:
> > >>
> > >> 1. Produce a script that gathers the README.md files from all code
> > >> subdirectories into a hierarchical list.  The script would have an
> > >> exclusion list for non-user-content, which at this point would
> > consist 

[GitHub] incubator-metron pull request #409: METRON-644 RPM builds only work with Doc...

2017-01-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-metron/pull/409


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #394: METRON-623: Management UI

2017-01-12 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/394
  
Wow, @merrimanr , there are currently 320 files changed in this PR.  I 
don't know how to start reviewing this.

Hopefully most or a lot of those go away when METRON-622 and METRON-503 are 
subtracted.  Could you please consider posting a PR against a branch that 
already has those committed, so we can see just the changes affected by this 
feature?  Alternatively you could use [Apache 
Reviewboard](https://reviews.apache.org/r/) to post such a diff.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Michael Miklavcic
Casey, Matt - These guys are using doxia
https://github.com/apache/falcon/tree/master/docs

Honestly, I kind of like Spark's approach -
https://github.com/apache/spark/tree/master/docs

Mike

On Thu, Jan 12, 2017 at 4:48 PM, Matt Foley  wrote:

> I’m ambivalent; I think we’d end up tied to the doxia processing pipeline,
> which is “yet another arcane toolset” to learn.  Using .md as the input
> format decreases the dependency, but we’d still be dependent on it.
>
> I had anticipated that the web page would be a write-once thing that would
> be only a couple days for an experienced Web developer. But I was going to
> get an estimate from some co-workers before actually trying to get it
> implemented. And the script is a few hours of work with find and awk.
>
> On the other hand, doxia is certainly an expectable solution.  Is setting
> up that infrastructure less work than developing the web page?  Or is it
> actually just a matter of a few lines in pom.xml?
>
>
> On 1/12/17, 3:24 PM, "Casey Stella"  wrote:
>
> Just a followup thought that's a bit more constructive, maybe we could
> migrate the README.md's into a site directory and use doxia markdown
> (example here ) to
> generate the site as part of the build to resolve 1 through 3?
>
> On Thu, Jan 12, 2017 at 6:19 PM, Casey Stella 
> wrote:
>
> > So, I do think this would be better than what we currently do.  I
> like a
> > few things in particular:
> >
> >- I don't like the wiki one bit.
> >- We have a LOT of documentation in the README.md's and it's
> sometimes
> >poorly organized
> >- I like a documentation preprocessing pipeline to be present.
> For
> >instance, a major ask is all of the stellar functions in one
> place.  That's
> >solved by updating an index manually in the READMEs and keeping
> it in sync
> >with the annotation.  I'd like to make a stellar annotation ->
> markdown
> >generator as part of the build and this would be nice for such a
> task.
> >
> > My only concern is that the html generation/viewer seems like a fair
> > amount of engineering.  Are you sure there isn't something easier
> that we
> > could conform to?  I'm sure we aren't the only project in the world
> that
> > has this particular issue.  Is there something like a maven site
> plugin or
> > something?  Just a thought.  I'll come back with more :)
> >
> > Great ideas!  Keep them coming!
> >
> > Casey
> >
> > On Thu, Jan 12, 2017 at 6:05 PM, Matt Foley 
> wrote:
> >
> >> We currently have three forms of documentation, with the following
> >> advantages and disadvantages:
> >>
> >> || Docs || Pro || Con ||
> >> | CWiki |
> >>   Easy to edit, no special tools required, don't have to be a
> >> developer to contribute, google and wiki search |
> >> Not versioned, no review process, distant from the code, obsolete
> content
> >> tends to accumulate |
> >> | Site |
> >>   Versioned and reviewed, only committers can edit, google
> search |
> >>   Yet another arcane toolset must be learned, only web
> programmers
> >> feel comfortable contributing, "asf-site" branch not related to code
> >> versions, distant from the code, tends to go obsolete due to
> >> non-maintenance |
> >> | README.md |
> >>   Versioned and reviewed, only committers can edit, tied to code
> >> versions, highly local to the code being documented |
> >>   Non-developers don't know about them, may be scared by
> github, poor
> >> scoring in google search, no high-level presentation |
> >>
> >> Various discussion threads indicate the developer community likes
> >> README-based docs, and it's easy to see why from the above.  I
> propose this
> >> extension to the README-based documentation, to address their
> disadvantages:
> >>
> >> 1. Produce a script that gathers the README.md files from all code
> >> subdirectories into a hierarchical list.  The script would have an
> >> exclusion list for non-user-content, which at this point would
> consist of
> >> [site/*, build_utils/*].  The hierarchy would be sorted
> depth-first.  The
> >> resulting hierarchical list at this time (with six added README
> files to
> >> complete the hierarchy) would be:
> >>
> >> ./README.md
> >> ./metron-analytics/README.md  <== (need file here)
> >> ./metron-analytics/metron-maas-service/README.md
> >> ./metron-analytics/metron-profiler/README.md
> >> ./metron-analytics/metron-profiler-client/README.md
> >> ./metron-analytics/metron-statistics/README.md
> >> ./metron-deployment/README.md
> >> ./metron-deployment/amazon-ec2/README.md
> >> ./metron-deployment/packaging/README.md  <== (need file here)
> >> ./metron-deployment/packaging/ambari/README.md <== (nee

Re: [DISCUSS] Dev Guide and Committer Review Guide additions?

2017-01-12 Thread Michael Miklavcic
"Also, what would people think of dropping Ansible in favor of Ambari and
Docker as the preferred deployment management approaches?"

Agreed about publishing via Ambari. I'm not sure about fully replacing
Vagrant just yet, but we could move that direction. Docker would allow us
to more easily test a realistic multi-node setup on a single machine. In
the meantime, maybe a quick win could be to use Ansible to deploy and
install the MPack to the quickdev environment? This way we're leveraging
the rpm's as well as the MPack code and installing in nearly the same
manner as most users.

On Thu, Jan 12, 2017 at 3:49 PM, Matt Foley  wrote:

> I think I hear 3 major areas not adequately covered by our usual “code
> review”:
> 1. Documentation
> 2. Deployment Builds
> 3. Management of config parameters
>
> The other areas mentioned by Otto (testing, perf test, Stellar impact, and
> REST api impact), are entirely valid, but fall under existing code and
> architecture that seems generally adequate.
>
> Regarding #1, Documentation, I’d like to branch a discussion thread for a
> proposal I’m about to make, to enhance our use of README files as usable
> and up-to-date end-user documentation, linked from the Metron site.
> Implicit in that is the idea that we’d deprecate using the cwiki for
> anything but long-lived demonstrations/tutorials that are unlikely to go
> obsolete.
>
> For #2, Deployment Builds:  This is difficult, and unfortunately I’m not
> an expert with these things, but we need to automate this as much as
> possible.  Config params will always interact heavily with deployment
> issues, but let’s leave that for #3 :0)
>
> As far as RPMs, Ansible playbooks, or Docker images go, we’d like to
> automate so that developers never have to do anything when they are
> committing modifications of existing components, and even when new
> components are added (like the Profiler is being added now), it should
> insofar as possible be automated via maven declarations.  But that takes
> input from the experts in each of the areas.
>
> Also, what would people think of dropping Ansible in favor of Ambari and
> Docker as the preferred deployment management approaches?
>
> #3, Management of config parameters:  I’ve been thinking about this
> lately, but haven’t written up a proposal yet.  I’m bothered by the wide
> ranging variability in the way Metron configs are managed: files,
> zookeeper, environment variables, traditional Hadoop-style configs, and
> roll-your-own json configs, sometimes shared, sometimes duplicated, not to
> mention Ambari over it all.  This has been encouraged by the huge number of
> Stack components that Metron depends on, and the relative independence of
> the components Metron itself is composed of.
>
> But I think as Otto points out, as we grow the number of components and
> mature out of the incubator, we have to get this under control.  We need an
> architecture for management of configuration parameters of the Metron
> topologies.  (We can’t do much about the Stack components, but Ambari is
> establishing a culture around managing those.)  The architecture needs to
> include update methodology for semantic changes in parameter sets.
>
> I’m mulling such an architecture, but what do other people think?  Is this
> a valid need?
>
> Thanks,
> --Matt
>
> On 1/12/17, 8:23 AM, "Michael Miklavcic" 
> wrote:
>
> Hi Otto,
>
> You make a great point.
>
> AFA RPM/MPack, we do have some work in the pipeline for streamlining
> things
> a bit with the RPM's and MPack code such that they will be used for
> performing the Metron install in the sandbox VM's rather than Ansible.
> (I'd
> search for the public Jiras and post them here, but Jira is down for
> maintenance currently.) This should help make it obvious that a change
> or
> new feature requires modifications because they will be in the critical
> path to testing.
>
> Documentation is still tricky because we have README files, javadoc,
> and
> the wiki. But in general I think the current approach is to put
> concrete
> functionality docs in the READMEs as much as possible because they can
> be
> tracked and versioned with Git. I think the community has actually been
> doing a pretty good job here. The wiki is a little more tricky because
> there is typically only one version, which tracks master, not
> necessarily
> the latest stable release.
>
> Mike
>
>
> On Thu, Jan 12, 2017 at 8:42 AM, Otto Fowler 
> wrote:
>
> > As Metron evolves to include new deployment options, features, and
> > configurations it is hard and only getting harder for contributors,
> > committers, and reviewers to understand what the required changes are
> > across the different areas of the system to correctly and completely
> > introduce a change or new feature in the system.
> >
> > We have talked some about the requirements or expectations for
> submitters
> > with regards

Re: [DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-12 Thread Michael Miklavcic
Hi Casey,

Thanks for starting this thread. I believe you are correct in your
assessment of the 4 options for updating configs in Metron. When using more
than one of these options we can get into a split-brain scenario. A basic
example is updating the global config on disk and using the
zk_load_configs.sh. Later, if a user decides to restart Ambari, the cached
version stored by Ambari (it's in the MySQL or other database backing
Ambari) will be written out to disk in the defined config directory, and
subsequently loaded using the zk_load_configs.sh under the hood. Any global
configuration modified outside of Ambari will be lost at this point. This
is obviously undesirable, but I also like the purpose and utility exposed
by the multiple config management interfaces we currently have available. I
also agree that a service would be best.

For reference, here's my understanding of the current configuration loading
mechanisms and their deps.

[image: Inline image 1]

Mike


On Thu, Jan 12, 2017 at 3:08 PM, Casey Stella  wrote:

> In the course of discussion on the PR for METRON-652
>  something that I
> should definitely have understood better came to light and I thought that
> it was worth bringing to the attention of the community to get
> clarification/discuss is just how we manage configs.
>
> Currently (assuming the management UI that Ryan Merriman submitted) configs
> are managed/adjusted via a couple of different mechanism.
>
>- zk_load_utils.sh: pushed and pulled from disk to zookeeper
>- Stellar REPL: pushed and pulled via the CONFIG_GET/CONFIG_PUT
> functions
>- Ambari: initialized via the zk_load_utils script and then some of them
>are managed directly (global config) and some indirectly
> (sensor-specific
>configs).
>   - NOTE: Upon service restart, it may or may not overwrite changes on
>   disk or on zookeeper.  *Can someone more knowledgeable than me about
>   this describe precisely the semantics that we can expect on
> service restart
>   for Ambari? What gets overwritten on disk and what gets updated
> in ambari?*
>- The Management UI: manages some of the configs. *RYAN: Which configs
>do we support here and which don't we support here?*
>
> As you can see, we have a mishmash of mechanisms to update and manage the
> configuration for Metron in zookeeper.  In the beginning the approach was
> just to edit configs on disk and push/pull them via zk_load_utils.  Configs
> could be historically managed using source control, etc.  As we got more
> and more components managing the configs, we haven't taken care that they
> they all work with each other in an expected way (I believe these are
> true..correct me if I'm wrong):
>
>- If configs are modified in the management UI or the Stellar REPL and
>someone forgets to pull the configs from zookeeper to disk, before they
> do
>a push via zk_load_utils, they will clobber the configs in zookeeper
> with
>old configs.
>- If the global config is changed on disk and the ambari service
>restarts, it'll get reset with the original global config.
>- *Ryan, in the management UI, if someone changes the zookeeper configs
>from outside, are those configs reflected immediately in the UI?*
>
>
> It seems to me that we have a couple of options here:
>
>- A service to intermediate and handle config update/retrieval and
>tracking historical changes so these different mechanisms can use a
> common
>component for config management/tracking and refactor the existing
>mechanisms to use that service
>- Standardize on exactly one component to manage the configs and regress
>the others (that's a verb, right?   nicer than delete.)
>
> I happen to like the service approach, myself, but I wanted to put it up
> for discussion and hopefully someone will volunteer to design such a thing.
>
> To frame the debate, I want us to keep in mind a couple of things that may
> or may not be relevant to the discussion:
>
>- We will eventually be moving to support kerberos so there should at
>least be a path to use kerberos for any solution IMO
>- There is value in each of the different mechanisms in place now.  If
>there weren't, then they wouldn't have been created.  Before we try to
> make
>this a "there can be only one" argument, I'd like to hear very good
>arguments.
>
> Finally, I'd appreciate if some people might answer the questions I have in
> bold there.  Hopefully this discussion, if nothing else happens, will
> result in fodder for proper documentation of the ins and outs of each of
> the components bulleted above.
>
> Best,
>
> Casey
>


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley
Ah, I see.  If overriding the default index name allows using the same name for 
multiple sensors, then the goal can be achieved.
Thanks,
--Matt


On 1/12/17, 3:30 PM, "Casey Stella"  wrote:

Oh, you could!  Let's say you have a syslog parser with data from sources 1
2 and 3.  You'd end up with one kafka queue with 3 parsers attached to that
queue, each picking part the messages from source 1, 2 and 3.  They'd go
through separate enrichment and into the indexing topology.  In the
indexing topology, you could specify the same index name "syslog" and all
of the messages go into the same index for CEP querying if so desired.

On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley  wrote:

> Syslog is hell on parsers – I know, I worked at LogLogic in a previous
> life.  It makes perfect sense to route different lines from syslog through
> different appropriate parsers.  But a lot of what the parsers do is
> identify consistent subsets of metadata and annotate it – eg, src_ip_addr,
> event timestamps, etc.  Once those metadata are annotated and available
> with common field names, why doesn’t it make sense to index the messages
> together, for CEP querying?  I think Splunk has illustrated this model.
>
> On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
>
> yeah, I mean, honestly, I think the approach that we've taken for
> sources
> which aggregate different types of data is to provide filters at the
> parser
> level and have multiple parser topologies (with different, possibly
> mutually exclusive filters) running.  This would be a completely
> separate
> sensor.  Imagine a syslog data source that aggregates and you want to
> pick
> apart certain pieces of messages.  This is why the initial thought and
> architecture was one index per sensor.
>
> On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:
>
> > I’m thinking that CEP (Complex Event Processing) is contrary to the
> idea
> > of silo-ing data per sensor.
> > Now it’s true that some of those sensors are already aggregating
> data from
> > multiple sources, so maybe I’m wrong here.
> > But it just seems to me that the “data lake” insights come from
> being able
> > to make decisions over the whole mass of data rather than just
> vertical
> > slices of it.
> >
> > On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
> >
> > Hey Matt,
> >
> > Thanks for the comment!
> > 1. At the moment, we only have one index name, the default of
> which is
> > the
> > sensor name but that's entirely up to the user.  This is sensor
> > specific,
> > so it'd be a separate config for each sensor.  If we want to
> build
> > multiple
> > indices per sensor, we'd have to think carefully about how to do
> that
> > and
> > would be a bigger undertaking.  I guess I can see the use, 
though
> > (redirect
> > messages to one index vs another based on a predicate for a 
given
> > sensor).
> > Anyway, not where I was originally thinking that this discussion
> would
> > go,
> > but it's an interesting point.
> >
> > 2. I hadn't thought through the implementation quite yet, but we
> don't
> > actually have a splitter bolt in that topology, just a spout
> that goes
> > to
> > the elasticsearch writer and also to the hdfs writer.
> >
> > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley 
> wrote:
> >
> > > Casey, good to have controls like this.  Couple questions:
> > >
> > > 1. Regarding the “index” : “squid” name/value pair, is the
> index name
> > > expected to always be a sensor name?  Or is the given json
> structure
> > > subordinate to a sensor name in zookeeper?  Or can we build
> arbitrary
> > > indexes with this new specification, independent of sensor?
> Should
> > there
> > > actually be a list of “indexes”, ie
> > > { “indexes” : [
> > > {“index” : “name1”,
> > > …
> > > },
> > > {“index” : “name2”,
> > > …
> > > } ]
> > > }
> > >
> > > 2. Would the filtering / writer selection logic take place in
> the
> > indexing
> > > topology splitter bolt?  Seems like that would have the
> smallest
> > impact on
> > > current implementation, no?
> > >
> > > Sorry if these are already answered in PR-415, I haven’t had
   

Re: [PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Matt Foley
I’m ambivalent; I think we’d end up tied to the doxia processing pipeline, 
which is “yet another arcane toolset” to learn.  Using .md as the input format 
decreases the dependency, but we’d still be dependent on it.

I had anticipated that the web page would be a write-once thing that would be 
only a couple days for an experienced Web developer. But I was going to get an 
estimate from some co-workers before actually trying to get it implemented. And 
the script is a few hours of work with find and awk.

On the other hand, doxia is certainly an expectable solution.  Is setting up 
that infrastructure less work than developing the web page?  Or is it actually 
just a matter of a few lines in pom.xml?


On 1/12/17, 3:24 PM, "Casey Stella"  wrote:

Just a followup thought that's a bit more constructive, maybe we could
migrate the README.md's into a site directory and use doxia markdown
(example here ) to
generate the site as part of the build to resolve 1 through 3?

On Thu, Jan 12, 2017 at 6:19 PM, Casey Stella  wrote:

> So, I do think this would be better than what we currently do.  I like a
> few things in particular:
>
>- I don't like the wiki one bit.
>- We have a LOT of documentation in the README.md's and it's sometimes
>poorly organized
>- I like a documentation preprocessing pipeline to be present.  For
>instance, a major ask is all of the stellar functions in one place.  
That's
>solved by updating an index manually in the READMEs and keeping it in 
sync
>with the annotation.  I'd like to make a stellar annotation -> markdown
>generator as part of the build and this would be nice for such a task.
>
> My only concern is that the html generation/viewer seems like a fair
> amount of engineering.  Are you sure there isn't something easier that we
> could conform to?  I'm sure we aren't the only project in the world that
> has this particular issue.  Is there something like a maven site plugin or
> something?  Just a thought.  I'll come back with more :)
>
> Great ideas!  Keep them coming!
>
> Casey
>
> On Thu, Jan 12, 2017 at 6:05 PM, Matt Foley  wrote:
>
>> We currently have three forms of documentation, with the following
>> advantages and disadvantages:
>>
>> || Docs || Pro || Con ||
>> | CWiki |
>>   Easy to edit, no special tools required, don't have to be a
>> developer to contribute, google and wiki search |
>> Not versioned, no review process, distant from the code, obsolete content
>> tends to accumulate |
>> | Site |
>>   Versioned and reviewed, only committers can edit, google search |
>>   Yet another arcane toolset must be learned, only web programmers
>> feel comfortable contributing, "asf-site" branch not related to code
>> versions, distant from the code, tends to go obsolete due to
>> non-maintenance |
>> | README.md |
>>   Versioned and reviewed, only committers can edit, tied to code
>> versions, highly local to the code being documented |
>>   Non-developers don't know about them, may be scared by github, poor
>> scoring in google search, no high-level presentation |
>>
>> Various discussion threads indicate the developer community likes
>> README-based docs, and it's easy to see why from the above.  I propose 
this
>> extension to the README-based documentation, to address their 
disadvantages:
>>
>> 1. Produce a script that gathers the README.md files from all code
>> subdirectories into a hierarchical list.  The script would have an
>> exclusion list for non-user-content, which at this point would consist of
>> [site/*, build_utils/*].  The hierarchy would be sorted depth-first.  The
>> resulting hierarchical list at this time (with six added README files to
>> complete the hierarchy) would be:
>>
>> ./README.md
>> ./metron-analytics/README.md  <== (need file here)
>> ./metron-analytics/metron-maas-service/README.md
>> ./metron-analytics/metron-profiler/README.md
>> ./metron-analytics/metron-profiler-client/README.md
>> ./metron-analytics/metron-statistics/README.md
>> ./metron-deployment/README.md
>> ./metron-deployment/amazon-ec2/README.md
>> ./metron-deployment/packaging/README.md  <== (need file here)
>> ./metron-deployment/packaging/ambari/README.md <== (need file here)
>> ./metron-deployment/packaging/docker/ansible-docker/README.md
>> ./metron-deployment/packaging/docker/rpm-docker/README.md
>> ./metron-deployment/packer-build/README.md
>> ./metron-deployment/roles/  <== (need file here)
>> ./metron-deployment/roles/kibana/README.md
>> ./metron-deployment/roles/monit/README.md
>> ./metron-deployment/roles/opentaxii/README.md
>> ./metron-deployment/roles/

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella
Oh, you could!  Let's say you have a syslog parser with data from sources 1
2 and 3.  You'd end up with one kafka queue with 3 parsers attached to that
queue, each picking part the messages from source 1, 2 and 3.  They'd go
through separate enrichment and into the indexing topology.  In the
indexing topology, you could specify the same index name "syslog" and all
of the messages go into the same index for CEP querying if so desired.

On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley  wrote:

> Syslog is hell on parsers – I know, I worked at LogLogic in a previous
> life.  It makes perfect sense to route different lines from syslog through
> different appropriate parsers.  But a lot of what the parsers do is
> identify consistent subsets of metadata and annotate it – eg, src_ip_addr,
> event timestamps, etc.  Once those metadata are annotated and available
> with common field names, why doesn’t it make sense to index the messages
> together, for CEP querying?  I think Splunk has illustrated this model.
>
> On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
>
> yeah, I mean, honestly, I think the approach that we've taken for
> sources
> which aggregate different types of data is to provide filters at the
> parser
> level and have multiple parser topologies (with different, possibly
> mutually exclusive filters) running.  This would be a completely
> separate
> sensor.  Imagine a syslog data source that aggregates and you want to
> pick
> apart certain pieces of messages.  This is why the initial thought and
> architecture was one index per sensor.
>
> On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:
>
> > I’m thinking that CEP (Complex Event Processing) is contrary to the
> idea
> > of silo-ing data per sensor.
> > Now it’s true that some of those sensors are already aggregating
> data from
> > multiple sources, so maybe I’m wrong here.
> > But it just seems to me that the “data lake” insights come from
> being able
> > to make decisions over the whole mass of data rather than just
> vertical
> > slices of it.
> >
> > On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
> >
> > Hey Matt,
> >
> > Thanks for the comment!
> > 1. At the moment, we only have one index name, the default of
> which is
> > the
> > sensor name but that's entirely up to the user.  This is sensor
> > specific,
> > so it'd be a separate config for each sensor.  If we want to
> build
> > multiple
> > indices per sensor, we'd have to think carefully about how to do
> that
> > and
> > would be a bigger undertaking.  I guess I can see the use, though
> > (redirect
> > messages to one index vs another based on a predicate for a given
> > sensor).
> > Anyway, not where I was originally thinking that this discussion
> would
> > go,
> > but it's an interesting point.
> >
> > 2. I hadn't thought through the implementation quite yet, but we
> don't
> > actually have a splitter bolt in that topology, just a spout
> that goes
> > to
> > the elasticsearch writer and also to the hdfs writer.
> >
> > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley 
> wrote:
> >
> > > Casey, good to have controls like this.  Couple questions:
> > >
> > > 1. Regarding the “index” : “squid” name/value pair, is the
> index name
> > > expected to always be a sensor name?  Or is the given json
> structure
> > > subordinate to a sensor name in zookeeper?  Or can we build
> arbitrary
> > > indexes with this new specification, independent of sensor?
> Should
> > there
> > > actually be a list of “indexes”, ie
> > > { “indexes” : [
> > > {“index” : “name1”,
> > > …
> > > },
> > > {“index” : “name2”,
> > > …
> > > } ]
> > > }
> > >
> > > 2. Would the filtering / writer selection logic take place in
> the
> > indexing
> > > topology splitter bolt?  Seems like that would have the
> smallest
> > impact on
> > > current implementation, no?
> > >
> > > Sorry if these are already answered in PR-415, I haven’t had
> time to
> > > review that one yet.
> > > Thanks,
> > > --Matt
> > >
> > >
> > > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> > michael.miklav...@gmail.com>
> > > wrote:
> > >
> > > I like the flexibility and expressibility of the first
> option
> > with
> > > Stellar
> > > filters.
> > >
> > > M
> > >
> > > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> > ceste...@gmail.com>
> > > wrote:
> > >
> > > > As of METRON-652 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley
Syslog is hell on parsers – I know, I worked at LogLogic in a previous life.  
It makes perfect sense to route different lines from syslog through different 
appropriate parsers.  But a lot of what the parsers do is identify consistent 
subsets of metadata and annotate it – eg, src_ip_addr, event timestamps, etc.  
Once those metadata are annotated and available with common field names, why 
doesn’t it make sense to index the messages together, for CEP querying?  I 
think Splunk has illustrated this model. 

On 1/12/17, 3:00 PM, "Casey Stella"  wrote:

yeah, I mean, honestly, I think the approach that we've taken for sources
which aggregate different types of data is to provide filters at the parser
level and have multiple parser topologies (with different, possibly
mutually exclusive filters) running.  This would be a completely separate
sensor.  Imagine a syslog data source that aggregates and you want to pick
apart certain pieces of messages.  This is why the initial thought and
architecture was one index per sensor.

On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:

> I’m thinking that CEP (Complex Event Processing) is contrary to the idea
> of silo-ing data per sensor.
> Now it’s true that some of those sensors are already aggregating data from
> multiple sources, so maybe I’m wrong here.
> But it just seems to me that the “data lake” insights come from being able
> to make decisions over the whole mass of data rather than just vertical
> slices of it.
>
> On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
>
> Hey Matt,
>
> Thanks for the comment!
> 1. At the moment, we only have one index name, the default of which is
> the
> sensor name but that's entirely up to the user.  This is sensor
> specific,
> so it'd be a separate config for each sensor.  If we want to build
> multiple
> indices per sensor, we'd have to think carefully about how to do that
> and
> would be a bigger undertaking.  I guess I can see the use, though
> (redirect
> messages to one index vs another based on a predicate for a given
> sensor).
> Anyway, not where I was originally thinking that this discussion would
> go,
> but it's an interesting point.
>
> 2. I hadn't thought through the implementation quite yet, but we don't
> actually have a splitter bolt in that topology, just a spout that goes
> to
> the elasticsearch writer and also to the hdfs writer.
>
> On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:
>
> > Casey, good to have controls like this.  Couple questions:
> >
> > 1. Regarding the “index” : “squid” name/value pair, is the index 
name
> > expected to always be a sensor name?  Or is the given json structure
> > subordinate to a sensor name in zookeeper?  Or can we build 
arbitrary
> > indexes with this new specification, independent of sensor?  Should
> there
> > actually be a list of “indexes”, ie
> > { “indexes” : [
> > {“index” : “name1”,
> > …
> > },
> > {“index” : “name2”,
> > …
> > } ]
> > }
> >
> > 2. Would the filtering / writer selection logic take place in the
> indexing
> > topology splitter bolt?  Seems like that would have the smallest
> impact on
> > current implementation, no?
> >
> > Sorry if these are already answered in PR-415, I haven’t had time to
> > review that one yet.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> michael.miklav...@gmail.com>
> > wrote:
> >
> > I like the flexibility and expressibility of the first option
> with
> > Stellar
> > filters.
> >
> > M
> >
> > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> >
> > > As of METRON-652  > incubator-metron/pull/415>, we
> > > will have decoupled the indexing configuration from the
> enrichment
> > > configuration.  As an immediate follow-up to that, I'd like to
> > provide the
> > > ability to turn off and on writers via the configs.  I'd like
> to get
> > some
> > > community feedback on how the functionality should work, if
> y'all are
> > > amenable. :)
> > >
> > >
> > > As of now, we have 3 possible writers which can be used in the
> > indexing
> > > topology:
> > >
> > >- Solr
> > >- Elasticsearch

Re: [PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Casey Stella
Just a followup thought that's a bit more constructive, maybe we could
migrate the README.md's into a site directory and use doxia markdown
(example here ) to
generate the site as part of the build to resolve 1 through 3?

On Thu, Jan 12, 2017 at 6:19 PM, Casey Stella  wrote:

> So, I do think this would be better than what we currently do.  I like a
> few things in particular:
>
>- I don't like the wiki one bit.
>- We have a LOT of documentation in the README.md's and it's sometimes
>poorly organized
>- I like a documentation preprocessing pipeline to be present.  For
>instance, a major ask is all of the stellar functions in one place.  That's
>solved by updating an index manually in the READMEs and keeping it in sync
>with the annotation.  I'd like to make a stellar annotation -> markdown
>generator as part of the build and this would be nice for such a task.
>
> My only concern is that the html generation/viewer seems like a fair
> amount of engineering.  Are you sure there isn't something easier that we
> could conform to?  I'm sure we aren't the only project in the world that
> has this particular issue.  Is there something like a maven site plugin or
> something?  Just a thought.  I'll come back with more :)
>
> Great ideas!  Keep them coming!
>
> Casey
>
> On Thu, Jan 12, 2017 at 6:05 PM, Matt Foley  wrote:
>
>> We currently have three forms of documentation, with the following
>> advantages and disadvantages:
>>
>> || Docs || Pro || Con ||
>> | CWiki |
>>   Easy to edit, no special tools required, don't have to be a
>> developer to contribute, google and wiki search |
>> Not versioned, no review process, distant from the code, obsolete content
>> tends to accumulate |
>> | Site |
>>   Versioned and reviewed, only committers can edit, google search |
>>   Yet another arcane toolset must be learned, only web programmers
>> feel comfortable contributing, "asf-site" branch not related to code
>> versions, distant from the code, tends to go obsolete due to
>> non-maintenance |
>> | README.md |
>>   Versioned and reviewed, only committers can edit, tied to code
>> versions, highly local to the code being documented |
>>   Non-developers don't know about them, may be scared by github, poor
>> scoring in google search, no high-level presentation |
>>
>> Various discussion threads indicate the developer community likes
>> README-based docs, and it's easy to see why from the above.  I propose this
>> extension to the README-based documentation, to address their disadvantages:
>>
>> 1. Produce a script that gathers the README.md files from all code
>> subdirectories into a hierarchical list.  The script would have an
>> exclusion list for non-user-content, which at this point would consist of
>> [site/*, build_utils/*].  The hierarchy would be sorted depth-first.  The
>> resulting hierarchical list at this time (with six added README files to
>> complete the hierarchy) would be:
>>
>> ./README.md
>> ./metron-analytics/README.md  <== (need file here)
>> ./metron-analytics/metron-maas-service/README.md
>> ./metron-analytics/metron-profiler/README.md
>> ./metron-analytics/metron-profiler-client/README.md
>> ./metron-analytics/metron-statistics/README.md
>> ./metron-deployment/README.md
>> ./metron-deployment/amazon-ec2/README.md
>> ./metron-deployment/packaging/README.md  <== (need file here)
>> ./metron-deployment/packaging/ambari/README.md <== (need file here)
>> ./metron-deployment/packaging/docker/ansible-docker/README.md
>> ./metron-deployment/packaging/docker/rpm-docker/README.md
>> ./metron-deployment/packer-build/README.md
>> ./metron-deployment/roles/  <== (need file here)
>> ./metron-deployment/roles/kibana/README.md
>> ./metron-deployment/roles/monit/README.md
>> ./metron-deployment/roles/opentaxii/README.md
>> ./metron-deployment/roles/pcap_replay/README.md
>> ./metron-deployment/roles/sensor-test-mode/README.md
>> ./metron-deployment/vagrant/README.md  <== (need file here)
>> ./metron-deployment/vagrant/codelab-platform/README.md
>> ./metron-deployment/vagrant/fastcapa-test-platform/README.md
>> ./metron-deployment/vagrant/full-dev-platform/README.md
>> ./metron-deployment/vagrant/quick-dev-platform/README.md
>> ./metron-platform/README.md
>> ./metron-platform/metron-api/README.md
>> ./metron-platform/metron-common/README.md
>> ./metron-platform/metron-data-management/README.md
>> ./metron-platform/metron-enrichment/README.md
>> ./metron-platform/metron-indexing/README.md
>> ./metron-platform/metron-management/README.md
>> ./metron-platform/metron-parsers/README.md
>> ./metron-platform/metron-pcap-backend/README.md
>> ./metron-sensors/README.md  <== (need file here)
>> ./metron-sensors/fastcapa/README.md
>> ./metron-sensors/pycapa/README.md
>>
>> 2. Arrange to run this script as part of the build process, and commit
>> the resulting hierarchy list to someplace in the versioned and branched
>> ./site/

[GitHub] incubator-metron issue #397: METRON-627: Add HyperLogLogPlus implementation ...

2017-01-12 Thread mmiklavc
Github user mmiklavc commented on the issue:

https://github.com/apache/incubator-metron/pull/397
  
@nickwallen @cestella Moved the HLLP implementation wrapper and Stellar 
functions to the metron-statistics project. Bloomfilter Stellar functions 
should probably be moved to the stats project as well, but that's for a 
separate PR, I feel. 

I also added a utility for generating performance metrics that includes a 
README (HLLP.md) that is linked to in the main metron-statistics README as well 
as the metron-common README. There is a link provided to the Google paper, an 
exposition of performance results, and a new default precision for the sparse 
and dense (normal) sets.

Demo sample soon to follow using the profiler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Casey Stella
So, I do think this would be better than what we currently do.  I like a
few things in particular:

   - I don't like the wiki one bit.
   - We have a LOT of documentation in the README.md's and it's sometimes
   poorly organized
   - I like a documentation preprocessing pipeline to be present.  For
   instance, a major ask is all of the stellar functions in one place.  That's
   solved by updating an index manually in the READMEs and keeping it in sync
   with the annotation.  I'd like to make a stellar annotation -> markdown
   generator as part of the build and this would be nice for such a task.

My only concern is that the html generation/viewer seems like a fair amount
of engineering.  Are you sure there isn't something easier that we could
conform to?  I'm sure we aren't the only project in the world that has this
particular issue.  Is there something like a maven site plugin or
something?  Just a thought.  I'll come back with more :)

Great ideas!  Keep them coming!

Casey

On Thu, Jan 12, 2017 at 6:05 PM, Matt Foley  wrote:

> We currently have three forms of documentation, with the following
> advantages and disadvantages:
>
> || Docs || Pro || Con ||
> | CWiki |
>   Easy to edit, no special tools required, don't have to be a
> developer to contribute, google and wiki search |
> Not versioned, no review process, distant from the code, obsolete content
> tends to accumulate |
> | Site |
>   Versioned and reviewed, only committers can edit, google search |
>   Yet another arcane toolset must be learned, only web programmers
> feel comfortable contributing, "asf-site" branch not related to code
> versions, distant from the code, tends to go obsolete due to
> non-maintenance |
> | README.md |
>   Versioned and reviewed, only committers can edit, tied to code
> versions, highly local to the code being documented |
>   Non-developers don't know about them, may be scared by github, poor
> scoring in google search, no high-level presentation |
>
> Various discussion threads indicate the developer community likes
> README-based docs, and it's easy to see why from the above.  I propose this
> extension to the README-based documentation, to address their disadvantages:
>
> 1. Produce a script that gathers the README.md files from all code
> subdirectories into a hierarchical list.  The script would have an
> exclusion list for non-user-content, which at this point would consist of
> [site/*, build_utils/*].  The hierarchy would be sorted depth-first.  The
> resulting hierarchical list at this time (with six added README files to
> complete the hierarchy) would be:
>
> ./README.md
> ./metron-analytics/README.md  <== (need file here)
> ./metron-analytics/metron-maas-service/README.md
> ./metron-analytics/metron-profiler/README.md
> ./metron-analytics/metron-profiler-client/README.md
> ./metron-analytics/metron-statistics/README.md
> ./metron-deployment/README.md
> ./metron-deployment/amazon-ec2/README.md
> ./metron-deployment/packaging/README.md  <== (need file here)
> ./metron-deployment/packaging/ambari/README.md <== (need file here)
> ./metron-deployment/packaging/docker/ansible-docker/README.md
> ./metron-deployment/packaging/docker/rpm-docker/README.md
> ./metron-deployment/packer-build/README.md
> ./metron-deployment/roles/  <== (need file here)
> ./metron-deployment/roles/kibana/README.md
> ./metron-deployment/roles/monit/README.md
> ./metron-deployment/roles/opentaxii/README.md
> ./metron-deployment/roles/pcap_replay/README.md
> ./metron-deployment/roles/sensor-test-mode/README.md
> ./metron-deployment/vagrant/README.md  <== (need file here)
> ./metron-deployment/vagrant/codelab-platform/README.md
> ./metron-deployment/vagrant/fastcapa-test-platform/README.md
> ./metron-deployment/vagrant/full-dev-platform/README.md
> ./metron-deployment/vagrant/quick-dev-platform/README.md
> ./metron-platform/README.md
> ./metron-platform/metron-api/README.md
> ./metron-platform/metron-common/README.md
> ./metron-platform/metron-data-management/README.md
> ./metron-platform/metron-enrichment/README.md
> ./metron-platform/metron-indexing/README.md
> ./metron-platform/metron-management/README.md
> ./metron-platform/metron-parsers/README.md
> ./metron-platform/metron-pcap-backend/README.md
> ./metron-sensors/README.md  <== (need file here)
> ./metron-sensors/fastcapa/README.md
> ./metron-sensors/pycapa/README.md
>
> 2. Arrange to run this script as part of the build process, and commit the
> resulting hierarchy list to someplace in the versioned and branched ./site/
> subdirectory.
>
> 3. Produce a "doc reader" web page that takes in this hierarchy of .md
> pages, and presents a LHS doc tree of links, and a main display area for a
> currently selected file.  If we want to get fancy, this page would also
> provide: (a) telescoping (collapse/expand) of the doc tree; (b) floating
> next/prev/up/home buttons in the display area.
>
> #4. Add to this web page a pull-down menu that selects among all the
> r

[PROPOSAL] up-to-date versioned documentation

2017-01-12 Thread Matt Foley
We currently have three forms of documentation, with the following advantages 
and disadvantages:

|| Docs || Pro || Con ||
| CWiki | 
  Easy to edit, no special tools required, don't have to be a developer to 
contribute, google and wiki search | 
Not versioned, no review process, distant from the code, obsolete content tends 
to accumulate |
| Site | 
  Versioned and reviewed, only committers can edit, google search | 
  Yet another arcane toolset must be learned, only web programmers feel 
comfortable contributing, "asf-site" branch not related to code versions, 
distant from the code, tends to go obsolete due to non-maintenance |
| README.md | 
  Versioned and reviewed, only committers can edit, tied to code versions, 
highly local to the code being documented | 
  Non-developers don't know about them, may be scared by github, poor 
scoring in google search, no high-level presentation |

Various discussion threads indicate the developer community likes README-based 
docs, and it's easy to see why from the above.  I propose this extension to the 
README-based documentation, to address their disadvantages:

1. Produce a script that gathers the README.md files from all code 
subdirectories into a hierarchical list.  The script would have an exclusion 
list for non-user-content, which at this point would consist of [site/*, 
build_utils/*].  The hierarchy would be sorted depth-first.  The resulting 
hierarchical list at this time (with six added README files to complete the 
hierarchy) would be:

./README.md
./metron-analytics/README.md  <== (need file here)
./metron-analytics/metron-maas-service/README.md
./metron-analytics/metron-profiler/README.md
./metron-analytics/metron-profiler-client/README.md
./metron-analytics/metron-statistics/README.md
./metron-deployment/README.md
./metron-deployment/amazon-ec2/README.md
./metron-deployment/packaging/README.md  <== (need file here)
./metron-deployment/packaging/ambari/README.md <== (need file here)
./metron-deployment/packaging/docker/ansible-docker/README.md
./metron-deployment/packaging/docker/rpm-docker/README.md
./metron-deployment/packer-build/README.md
./metron-deployment/roles/  <== (need file here)
./metron-deployment/roles/kibana/README.md
./metron-deployment/roles/monit/README.md
./metron-deployment/roles/opentaxii/README.md
./metron-deployment/roles/pcap_replay/README.md
./metron-deployment/roles/sensor-test-mode/README.md
./metron-deployment/vagrant/README.md  <== (need file here)
./metron-deployment/vagrant/codelab-platform/README.md
./metron-deployment/vagrant/fastcapa-test-platform/README.md
./metron-deployment/vagrant/full-dev-platform/README.md
./metron-deployment/vagrant/quick-dev-platform/README.md
./metron-platform/README.md
./metron-platform/metron-api/README.md
./metron-platform/metron-common/README.md
./metron-platform/metron-data-management/README.md
./metron-platform/metron-enrichment/README.md
./metron-platform/metron-indexing/README.md
./metron-platform/metron-management/README.md
./metron-platform/metron-parsers/README.md
./metron-platform/metron-pcap-backend/README.md
./metron-sensors/README.md  <== (need file here)
./metron-sensors/fastcapa/README.md
./metron-sensors/pycapa/README.md

2. Arrange to run this script as part of the build process, and commit the 
resulting hierarchy list to someplace in the versioned and branched ./site/ 
subdirectory.

3. Produce a "doc reader" web page that takes in this hierarchy of .md pages, 
and presents a LHS doc tree of links, and a main display area for a currently 
selected file.  If we want to get fancy, this page would also provide: (a) 
telescoping (collapse/expand) of the doc tree; (b) floating next/prev/up/home 
buttons in the display area.

#4. Add to this web page a pull-down menu that selects among all the release 
versions of Metron, and (if not running in the Apache site) a SNAPSHOT version 
for the current filesystem version (for developer preview).  Let it re-write 
the file paths per release version to the proper release tag in github.  This 
web page will therefore be version-independent.  Put it in the asf-site branch 
of the Apache site, as the new "docs" sub-site from the home web page.  Update 
the list of releases at each release, or if we want to get fancy, teach it to 
read the release tags from github.

5. As part of the release process, the release manager (a) assures the release 
is tagged in github with a consistent naming convention, and (b) submits the 
new hierarchy of links to google search (there's an api for that).

6. Deprecate the use of cwiki for anything but long-lived 
demonstrations/tutorials that are unlikely to go obsolete.


Do folks feel this would be a good contribution to the visibility, timeliness, 
and usability of our docs?
Is this an adequate solution for the current problems?

Thanks, 
--Matt




Re: [DISCUSS] Dev Guide and Committer Review Guide additions?

2017-01-12 Thread Matt Foley
Casey, great, we crossed messages!  Thanks for starting that thread, I’ll 
participate there.
--Matt

On 1/12/17, 2:51 PM, "Casey Stella"  wrote:

Regarding 3, Matt, I just started a dev list discussion about configs and
the various components that manage them and how they interact.  Hopefully
we end up in a coherent approach, but in the lead of that, I'd say yes,
valid need for such an architecture.  Please chime in on that thread or
even in reply to this thread (I'll take anything I can get ;) with thoughts.

On Thu, Jan 12, 2017 at 5:49 PM, Matt Foley  wrote:

> I think I hear 3 major areas not adequately covered by our usual “code
> review”:
> 1. Documentation
> 2. Deployment Builds
> 3. Management of config parameters
>
> The other areas mentioned by Otto (testing, perf test, Stellar impact, and
> REST api impact), are entirely valid, but fall under existing code and
> architecture that seems generally adequate.
>
> Regarding #1, Documentation, I’d like to branch a discussion thread for a
> proposal I’m about to make, to enhance our use of README files as usable
> and up-to-date end-user documentation, linked from the Metron site.
> Implicit in that is the idea that we’d deprecate using the cwiki for
> anything but long-lived demonstrations/tutorials that are unlikely to go
> obsolete.
>
> For #2, Deployment Builds:  This is difficult, and unfortunately I’m not
> an expert with these things, but we need to automate this as much as
> possible.  Config params will always interact heavily with deployment
> issues, but let’s leave that for #3 :0)
>
> As far as RPMs, Ansible playbooks, or Docker images go, we’d like to
> automate so that developers never have to do anything when they are
> committing modifications of existing components, and even when new
> components are added (like the Profiler is being added now), it should
> insofar as possible be automated via maven declarations.  But that takes
> input from the experts in each of the areas.
>
> Also, what would people think of dropping Ansible in favor of Ambari and
> Docker as the preferred deployment management approaches?
>
> #3, Management of config parameters:  I’ve been thinking about this
> lately, but haven’t written up a proposal yet.  I’m bothered by the wide
> ranging variability in the way Metron configs are managed: files,
> zookeeper, environment variables, traditional Hadoop-style configs, and
> roll-your-own json configs, sometimes shared, sometimes duplicated, not to
> mention Ambari over it all.  This has been encouraged by the huge number 
of
> Stack components that Metron depends on, and the relative independence of
> the components Metron itself is composed of.
>
> But I think as Otto points out, as we grow the number of components and
> mature out of the incubator, we have to get this under control.  We need 
an
> architecture for management of configuration parameters of the Metron
> topologies.  (We can’t do much about the Stack components, but Ambari is
> establishing a culture around managing those.)  The architecture needs to
> include update methodology for semantic changes in parameter sets.
>
> I’m mulling such an architecture, but what do other people think?  Is this
> a valid need?
>
> Thanks,
> --Matt
>
> On 1/12/17, 8:23 AM, "Michael Miklavcic" 
> wrote:
>
> Hi Otto,
>
> You make a great point.
>
> AFA RPM/MPack, we do have some work in the pipeline for streamlining
> things
> a bit with the RPM's and MPack code such that they will be used for
> performing the Metron install in the sandbox VM's rather than Ansible.
> (I'd
> search for the public Jiras and post them here, but Jira is down for
> maintenance currently.) This should help make it obvious that a change
> or
> new feature requires modifications because they will be in the 
critical
> path to testing.
>
> Documentation is still tricky because we have README files, javadoc,
> and
> the wiki. But in general I think the current approach is to put
> concrete
> functionality docs in the READMEs as much as possible because they can
> be
> tracked and versioned with Git. I think the community has actually 
been
> doing a pretty good job here. The wiki is a little more tricky because
> there is typically only one version, which tracks master, not
> necessarily
> the latest stable release.
>
> Mike
>
>
> On Thu, Jan 12, 2017 at 8:42 AM, Otto Fowler 
> wrote:
>
> > As Metron evolves to include new deployment options, features, and
> > configurations it is hard and only getting harder for contr

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella
yeah, I mean, honestly, I think the approach that we've taken for sources
which aggregate different types of data is to provide filters at the parser
level and have multiple parser topologies (with different, possibly
mutually exclusive filters) running.  This would be a completely separate
sensor.  Imagine a syslog data source that aggregates and you want to pick
apart certain pieces of messages.  This is why the initial thought and
architecture was one index per sensor.

On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:

> I’m thinking that CEP (Complex Event Processing) is contrary to the idea
> of silo-ing data per sensor.
> Now it’s true that some of those sensors are already aggregating data from
> multiple sources, so maybe I’m wrong here.
> But it just seems to me that the “data lake” insights come from being able
> to make decisions over the whole mass of data rather than just vertical
> slices of it.
>
> On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
>
> Hey Matt,
>
> Thanks for the comment!
> 1. At the moment, we only have one index name, the default of which is
> the
> sensor name but that's entirely up to the user.  This is sensor
> specific,
> so it'd be a separate config for each sensor.  If we want to build
> multiple
> indices per sensor, we'd have to think carefully about how to do that
> and
> would be a bigger undertaking.  I guess I can see the use, though
> (redirect
> messages to one index vs another based on a predicate for a given
> sensor).
> Anyway, not where I was originally thinking that this discussion would
> go,
> but it's an interesting point.
>
> 2. I hadn't thought through the implementation quite yet, but we don't
> actually have a splitter bolt in that topology, just a spout that goes
> to
> the elasticsearch writer and also to the hdfs writer.
>
> On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:
>
> > Casey, good to have controls like this.  Couple questions:
> >
> > 1. Regarding the “index” : “squid” name/value pair, is the index name
> > expected to always be a sensor name?  Or is the given json structure
> > subordinate to a sensor name in zookeeper?  Or can we build arbitrary
> > indexes with this new specification, independent of sensor?  Should
> there
> > actually be a list of “indexes”, ie
> > { “indexes” : [
> > {“index” : “name1”,
> > …
> > },
> > {“index” : “name2”,
> > …
> > } ]
> > }
> >
> > 2. Would the filtering / writer selection logic take place in the
> indexing
> > topology splitter bolt?  Seems like that would have the smallest
> impact on
> > current implementation, no?
> >
> > Sorry if these are already answered in PR-415, I haven’t had time to
> > review that one yet.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> michael.miklav...@gmail.com>
> > wrote:
> >
> > I like the flexibility and expressibility of the first option
> with
> > Stellar
> > filters.
> >
> > M
> >
> > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> >
> > > As of METRON-652  > incubator-metron/pull/415>, we
> > > will have decoupled the indexing configuration from the
> enrichment
> > > configuration.  As an immediate follow-up to that, I'd like to
> > provide the
> > > ability to turn off and on writers via the configs.  I'd like
> to get
> > some
> > > community feedback on how the functionality should work, if
> y'all are
> > > amenable. :)
> > >
> > >
> > > As of now, we have 3 possible writers which can be used in the
> > indexing
> > > topology:
> > >
> > >- Solr
> > >- Elasticsearch
> > >- HDFS
> > >
> > > HDFS is always used, elasticsearch or solr is used depending
> on how
> > you
> > > start the indexing topology.
> > >
> > > A couple of proposals come to mind immediately:
> > >
> > > *Index Filtering*
> > >
> > > You would be able to specify a filter as defined by a stellar
> > statement
> > > (likely a reuse of the StellarFilter that exists in the
> Parsers)
> > which
> > > would allow you to indicate on a message-by-message basis
> whether or
> > not to
> > > write the message.
> > >
> > > The semantics of this would be as follows:
> > >
> > >- Default (i.e. unspecified) is to pass everything through
> (hence
> > >backwards compatible with the current default config).
> > >- Messages which have the associated stellar statement
> evaluate
> > to true
> > >for the 

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley
I’m thinking that CEP (Complex Event Processing) is contrary to the idea of 
silo-ing data per sensor.
Now it’s true that some of those sensors are already aggregating data from 
multiple sources, so maybe I’m wrong here.
But it just seems to me that the “data lake” insights come from being able to 
make decisions over the whole mass of data rather than just vertical slices of 
it.

On 1/12/17, 2:15 PM, "Casey Stella"  wrote:

Hey Matt,

Thanks for the comment!
1. At the moment, we only have one index name, the default of which is the
sensor name but that's entirely up to the user.  This is sensor specific,
so it'd be a separate config for each sensor.  If we want to build multiple
indices per sensor, we'd have to think carefully about how to do that and
would be a bigger undertaking.  I guess I can see the use, though (redirect
messages to one index vs another based on a predicate for a given sensor).
Anyway, not where I was originally thinking that this discussion would go,
but it's an interesting point.

2. I hadn't thought through the implementation quite yet, but we don't
actually have a splitter bolt in that topology, just a spout that goes to
the elasticsearch writer and also to the hdfs writer.

On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:

> Casey, good to have controls like this.  Couple questions:
>
> 1. Regarding the “index” : “squid” name/value pair, is the index name
> expected to always be a sensor name?  Or is the given json structure
> subordinate to a sensor name in zookeeper?  Or can we build arbitrary
> indexes with this new specification, independent of sensor?  Should there
> actually be a list of “indexes”, ie
> { “indexes” : [
> {“index” : “name1”,
> …
> },
> {“index” : “name2”,
> …
> } ]
> }
>
> 2. Would the filtering / writer selection logic take place in the indexing
> topology splitter bolt?  Seems like that would have the smallest impact on
> current implementation, no?
>
> Sorry if these are already answered in PR-415, I haven’t had time to
> review that one yet.
> Thanks,
> --Matt
>
>
> On 1/12/17, 12:55 PM, "Michael Miklavcic" 
> wrote:
>
> I like the flexibility and expressibility of the first option with
> Stellar
> filters.
>
> M
>
> On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella 
> wrote:
>
> > As of METRON-652  incubator-metron/pull/415>, we
> > will have decoupled the indexing configuration from the enrichment
> > configuration.  As an immediate follow-up to that, I'd like to
> provide the
> > ability to turn off and on writers via the configs.  I'd like to get
> some
> > community feedback on how the functionality should work, if y'all 
are
> > amenable. :)
> >
> >
> > As of now, we have 3 possible writers which can be used in the
> indexing
> > topology:
> >
> >- Solr
> >- Elasticsearch
> >- HDFS
> >
> > HDFS is always used, elasticsearch or solr is used depending on how
> you
> > start the indexing topology.
> >
> > A couple of proposals come to mind immediately:
> >
> > *Index Filtering*
> >
> > You would be able to specify a filter as defined by a stellar
> statement
> > (likely a reuse of the StellarFilter that exists in the Parsers)
> which
> > would allow you to indicate on a message-by-message basis whether or
> not to
> > write the message.
> >
> > The semantics of this would be as follows:
> >
> >- Default (i.e. unspecified) is to pass everything through (hence
> >backwards compatible with the current default config).
> >- Messages which have the associated stellar statement evaluate
> to true
> >for the writer type will be written, otherwise not.
> >
> >
> > Sample indexing config which would write out no messages to HDFS and
> write
> > out only messages containing a field called "field1":
> > {
> >"index" : "squid"
> >   ,"batchSize" : 100
> >   ,"filters" : {
> >   "HDFS" : "false"
> >  ,"ES" : "exists(field1)"
> >  }
> > }
> >
> > *Index On/Off Switch*
> >
> > A simpler solution would be to just provide a list of writers to
> write
> > messages.  The semantics would be as follows:
> >
> >- If the list is unspecified, then the default is to write all
> messages
> >for every writer in the indexing

Re: [DISCUSS] Dev Guide and Committer Review Guide additions?

2017-01-12 Thread Casey Stella
Regarding 3, Matt, I just started a dev list discussion about configs and
the various components that manage them and how they interact.  Hopefully
we end up in a coherent approach, but in the lead of that, I'd say yes,
valid need for such an architecture.  Please chime in on that thread or
even in reply to this thread (I'll take anything I can get ;) with thoughts.

On Thu, Jan 12, 2017 at 5:49 PM, Matt Foley  wrote:

> I think I hear 3 major areas not adequately covered by our usual “code
> review”:
> 1. Documentation
> 2. Deployment Builds
> 3. Management of config parameters
>
> The other areas mentioned by Otto (testing, perf test, Stellar impact, and
> REST api impact), are entirely valid, but fall under existing code and
> architecture that seems generally adequate.
>
> Regarding #1, Documentation, I’d like to branch a discussion thread for a
> proposal I’m about to make, to enhance our use of README files as usable
> and up-to-date end-user documentation, linked from the Metron site.
> Implicit in that is the idea that we’d deprecate using the cwiki for
> anything but long-lived demonstrations/tutorials that are unlikely to go
> obsolete.
>
> For #2, Deployment Builds:  This is difficult, and unfortunately I’m not
> an expert with these things, but we need to automate this as much as
> possible.  Config params will always interact heavily with deployment
> issues, but let’s leave that for #3 :0)
>
> As far as RPMs, Ansible playbooks, or Docker images go, we’d like to
> automate so that developers never have to do anything when they are
> committing modifications of existing components, and even when new
> components are added (like the Profiler is being added now), it should
> insofar as possible be automated via maven declarations.  But that takes
> input from the experts in each of the areas.
>
> Also, what would people think of dropping Ansible in favor of Ambari and
> Docker as the preferred deployment management approaches?
>
> #3, Management of config parameters:  I’ve been thinking about this
> lately, but haven’t written up a proposal yet.  I’m bothered by the wide
> ranging variability in the way Metron configs are managed: files,
> zookeeper, environment variables, traditional Hadoop-style configs, and
> roll-your-own json configs, sometimes shared, sometimes duplicated, not to
> mention Ambari over it all.  This has been encouraged by the huge number of
> Stack components that Metron depends on, and the relative independence of
> the components Metron itself is composed of.
>
> But I think as Otto points out, as we grow the number of components and
> mature out of the incubator, we have to get this under control.  We need an
> architecture for management of configuration parameters of the Metron
> topologies.  (We can’t do much about the Stack components, but Ambari is
> establishing a culture around managing those.)  The architecture needs to
> include update methodology for semantic changes in parameter sets.
>
> I’m mulling such an architecture, but what do other people think?  Is this
> a valid need?
>
> Thanks,
> --Matt
>
> On 1/12/17, 8:23 AM, "Michael Miklavcic" 
> wrote:
>
> Hi Otto,
>
> You make a great point.
>
> AFA RPM/MPack, we do have some work in the pipeline for streamlining
> things
> a bit with the RPM's and MPack code such that they will be used for
> performing the Metron install in the sandbox VM's rather than Ansible.
> (I'd
> search for the public Jiras and post them here, but Jira is down for
> maintenance currently.) This should help make it obvious that a change
> or
> new feature requires modifications because they will be in the critical
> path to testing.
>
> Documentation is still tricky because we have README files, javadoc,
> and
> the wiki. But in general I think the current approach is to put
> concrete
> functionality docs in the READMEs as much as possible because they can
> be
> tracked and versioned with Git. I think the community has actually been
> doing a pretty good job here. The wiki is a little more tricky because
> there is typically only one version, which tracks master, not
> necessarily
> the latest stable release.
>
> Mike
>
>
> On Thu, Jan 12, 2017 at 8:42 AM, Otto Fowler 
> wrote:
>
> > As Metron evolves to include new deployment options, features, and
> > configurations it is hard and only getting harder for contributors,
> > committers, and reviewers to understand what the required changes are
> > across the different areas of the system to correctly and completely
> > introduce a change or new feature in the system.
> >
> > We have talked some about the requirements or expectations for
> submitters
> > with regards to tests and coverage, coding style, and documentation
> but I
> > don’t think we have enough guidance on deployment or other changes
> that
> > need to be considered.  For committers it is pretty much the sam

Re: [DISCUSS] Dev Guide and Committer Review Guide additions?

2017-01-12 Thread Matt Foley
I think I hear 3 major areas not adequately covered by our usual “code review”:
1. Documentation
2. Deployment Builds
3. Management of config parameters

The other areas mentioned by Otto (testing, perf test, Stellar impact, and REST 
api impact), are entirely valid, but fall under existing code and architecture 
that seems generally adequate.

Regarding #1, Documentation, I’d like to branch a discussion thread for a 
proposal I’m about to make, to enhance our use of README files as usable and 
up-to-date end-user documentation, linked from the Metron site.  Implicit in 
that is the idea that we’d deprecate using the cwiki for anything but 
long-lived demonstrations/tutorials that are unlikely to go obsolete.

For #2, Deployment Builds:  This is difficult, and unfortunately I’m not an 
expert with these things, but we need to automate this as much as possible.  
Config params will always interact heavily with deployment issues, but let’s 
leave that for #3 :0)

As far as RPMs, Ansible playbooks, or Docker images go, we’d like to automate 
so that developers never have to do anything when they are committing 
modifications of existing components, and even when new components are added 
(like the Profiler is being added now), it should insofar as possible be 
automated via maven declarations.  But that takes input from the experts in 
each of the areas.  

Also, what would people think of dropping Ansible in favor of Ambari and Docker 
as the preferred deployment management approaches?

#3, Management of config parameters:  I’ve been thinking about this lately, but 
haven’t written up a proposal yet.  I’m bothered by the wide ranging 
variability in the way Metron configs are managed: files, zookeeper, 
environment variables, traditional Hadoop-style configs, and roll-your-own json 
configs, sometimes shared, sometimes duplicated, not to mention Ambari over it 
all.  This has been encouraged by the huge number of Stack components that 
Metron depends on, and the relative independence of the components Metron 
itself is composed of.

But I think as Otto points out, as we grow the number of components and mature 
out of the incubator, we have to get this under control.  We need an 
architecture for management of configuration parameters of the Metron 
topologies.  (We can’t do much about the Stack components, but Ambari is 
establishing a culture around managing those.)  The architecture needs to 
include update methodology for semantic changes in parameter sets.

I’m mulling such an architecture, but what do other people think?  Is this a 
valid need?

Thanks,
--Matt

On 1/12/17, 8:23 AM, "Michael Miklavcic"  wrote:

Hi Otto,

You make a great point.

AFA RPM/MPack, we do have some work in the pipeline for streamlining things
a bit with the RPM's and MPack code such that they will be used for
performing the Metron install in the sandbox VM's rather than Ansible. (I'd
search for the public Jiras and post them here, but Jira is down for
maintenance currently.) This should help make it obvious that a change or
new feature requires modifications because they will be in the critical
path to testing.

Documentation is still tricky because we have README files, javadoc, and
the wiki. But in general I think the current approach is to put concrete
functionality docs in the READMEs as much as possible because they can be
tracked and versioned with Git. I think the community has actually been
doing a pretty good job here. The wiki is a little more tricky because
there is typically only one version, which tracks master, not necessarily
the latest stable release.

Mike


On Thu, Jan 12, 2017 at 8:42 AM, Otto Fowler 
wrote:

> As Metron evolves to include new deployment options, features, and
> configurations it is hard and only getting harder for contributors,
> committers, and reviewers to understand what the required changes are
> across the different areas of the system to correctly and completely
> introduce a change or new feature in the system.
>
> We have talked some about the requirements or expectations for submitters
> with regards to tests and coverage, coding style, and documentation  but I
> don’t think we have enough guidance on deployment or other changes that
> need to be considered.  For committers it is pretty much the same, with 
the
> extra stuff around that process.
>
> Right now it seems as a committer I’m counting on others like Nick or 
Casey
> to understand anything that may be missing from a submission when I review
> it.  Should there by an Ambari/RPM change?   Does this change the RestAPI?
> Does this effect STELLAR Lang/SHELL?  Does it need customer Docker Compose
> work?  etc etc.
>
> I think as we grow the community and try to get out of incubation it will
> be impractical for us to count on this, and we are ev

[GitHub] incubator-metron issue #316: METRON-503: Metron REST API

2017-01-12 Thread jjmeyer0
Github user jjmeyer0 commented on the issue:

https://github.com/apache/incubator-metron/pull/316
  
@merrimanr, the reasoning I used to include the http status code in the 
error message format is if there is a need to send these messages to a 
downstream system. For example, let's say a developer is integrating with the 
Metron API. For some reason they want to drop all error responses on a queue 
for processing by another system. If the status code didn't exist in the error 
format some context would be lost. But to your point maybe it doesn't buy us 
much. I don't have a strong preference either way on this one. However, maybe 
at some point it may be worth having a custom attribute called `code` that 
would allow a user to look up the errors in documentation. It could potentially 
show things like common causes and workarounds. That sounds like a separate PR 
with a lot of discussion around it though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella
Hey Matt,

Thanks for the comment!
1. At the moment, we only have one index name, the default of which is the
sensor name but that's entirely up to the user.  This is sensor specific,
so it'd be a separate config for each sensor.  If we want to build multiple
indices per sensor, we'd have to think carefully about how to do that and
would be a bigger undertaking.  I guess I can see the use, though (redirect
messages to one index vs another based on a predicate for a given sensor).
Anyway, not where I was originally thinking that this discussion would go,
but it's an interesting point.

2. I hadn't thought through the implementation quite yet, but we don't
actually have a splitter bolt in that topology, just a spout that goes to
the elasticsearch writer and also to the hdfs writer.

On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:

> Casey, good to have controls like this.  Couple questions:
>
> 1. Regarding the “index” : “squid” name/value pair, is the index name
> expected to always be a sensor name?  Or is the given json structure
> subordinate to a sensor name in zookeeper?  Or can we build arbitrary
> indexes with this new specification, independent of sensor?  Should there
> actually be a list of “indexes”, ie
> { “indexes” : [
> {“index” : “name1”,
> …
> },
> {“index” : “name2”,
> …
> } ]
> }
>
> 2. Would the filtering / writer selection logic take place in the indexing
> topology splitter bolt?  Seems like that would have the smallest impact on
> current implementation, no?
>
> Sorry if these are already answered in PR-415, I haven’t had time to
> review that one yet.
> Thanks,
> --Matt
>
>
> On 1/12/17, 12:55 PM, "Michael Miklavcic" 
> wrote:
>
> I like the flexibility and expressibility of the first option with
> Stellar
> filters.
>
> M
>
> On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella 
> wrote:
>
> > As of METRON-652  incubator-metron/pull/415>, we
> > will have decoupled the indexing configuration from the enrichment
> > configuration.  As an immediate follow-up to that, I'd like to
> provide the
> > ability to turn off and on writers via the configs.  I'd like to get
> some
> > community feedback on how the functionality should work, if y'all are
> > amenable. :)
> >
> >
> > As of now, we have 3 possible writers which can be used in the
> indexing
> > topology:
> >
> >- Solr
> >- Elasticsearch
> >- HDFS
> >
> > HDFS is always used, elasticsearch or solr is used depending on how
> you
> > start the indexing topology.
> >
> > A couple of proposals come to mind immediately:
> >
> > *Index Filtering*
> >
> > You would be able to specify a filter as defined by a stellar
> statement
> > (likely a reuse of the StellarFilter that exists in the Parsers)
> which
> > would allow you to indicate on a message-by-message basis whether or
> not to
> > write the message.
> >
> > The semantics of this would be as follows:
> >
> >- Default (i.e. unspecified) is to pass everything through (hence
> >backwards compatible with the current default config).
> >- Messages which have the associated stellar statement evaluate
> to true
> >for the writer type will be written, otherwise not.
> >
> >
> > Sample indexing config which would write out no messages to HDFS and
> write
> > out only messages containing a field called "field1":
> > {
> >"index" : "squid"
> >   ,"batchSize" : 100
> >   ,"filters" : {
> >   "HDFS" : "false"
> >  ,"ES" : "exists(field1)"
> >  }
> > }
> >
> > *Index On/Off Switch*
> >
> > A simpler solution would be to just provide a list of writers to
> write
> > messages.  The semantics would be as follows:
> >
> >- If the list is unspecified, then the default is to write all
> messages
> >for every writer in the indexing topology
> >- If the list is specified, then a writer will write all messages
> if and
> >only if it is named in the list.
> >
> > Sample indexing config which turns off HDFS and keeps on
> Elasticsearch:
> > {
> >"index" : "squid"
> >   ,"batchSize" : 100
> >   ,"writers" : [ "ES" ]
> > }
> >
> > Thanks in advance for the feedback!  Also, if you have any other,
> better
> > ideas than the ones presented here, let me know too.
> >
> > Best,
> >
> > Casey
> >
>
>
>
>
>


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread mmiklavc
Github user mmiklavc commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@dlyle65535 It definitely does that - I was referring to our json config 
mainly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[DISCUSS] Ambari Metron Configuration Management consequences and call to action

2017-01-12 Thread Casey Stella
In the course of discussion on the PR for METRON-652
 something that I
should definitely have understood better came to light and I thought that
it was worth bringing to the attention of the community to get
clarification/discuss is just how we manage configs.

Currently (assuming the management UI that Ryan Merriman submitted) configs
are managed/adjusted via a couple of different mechanism.

   - zk_load_utils.sh: pushed and pulled from disk to zookeeper
   - Stellar REPL: pushed and pulled via the CONFIG_GET/CONFIG_PUT functions
   - Ambari: initialized via the zk_load_utils script and then some of them
   are managed directly (global config) and some indirectly (sensor-specific
   configs).
  - NOTE: Upon service restart, it may or may not overwrite changes on
  disk or on zookeeper.  *Can someone more knowledgeable than me about
  this describe precisely the semantics that we can expect on
service restart
  for Ambari? What gets overwritten on disk and what gets updated
in ambari?*
   - The Management UI: manages some of the configs. *RYAN: Which configs
   do we support here and which don't we support here?*

As you can see, we have a mishmash of mechanisms to update and manage the
configuration for Metron in zookeeper.  In the beginning the approach was
just to edit configs on disk and push/pull them via zk_load_utils.  Configs
could be historically managed using source control, etc.  As we got more
and more components managing the configs, we haven't taken care that they
they all work with each other in an expected way (I believe these are
true..correct me if I'm wrong):

   - If configs are modified in the management UI or the Stellar REPL and
   someone forgets to pull the configs from zookeeper to disk, before they do
   a push via zk_load_utils, they will clobber the configs in zookeeper with
   old configs.
   - If the global config is changed on disk and the ambari service
   restarts, it'll get reset with the original global config.
   - *Ryan, in the management UI, if someone changes the zookeeper configs
   from outside, are those configs reflected immediately in the UI?*


It seems to me that we have a couple of options here:

   - A service to intermediate and handle config update/retrieval and
   tracking historical changes so these different mechanisms can use a common
   component for config management/tracking and refactor the existing
   mechanisms to use that service
   - Standardize on exactly one component to manage the configs and regress
   the others (that's a verb, right?   nicer than delete.)

I happen to like the service approach, myself, but I wanted to put it up
for discussion and hopefully someone will volunteer to design such a thing.

To frame the debate, I want us to keep in mind a couple of things that may
or may not be relevant to the discussion:

   - We will eventually be moving to support kerberos so there should at
   least be a path to use kerberos for any solution IMO
   - There is value in each of the different mechanisms in place now.  If
   there weren't, then they wouldn't have been created.  Before we try to make
   this a "there can be only one" argument, I'd like to hear very good
   arguments.

Finally, I'd appreciate if some people might answer the questions I have in
bold there.  Hopefully this discussion, if nothing else happens, will
result in fodder for proper documentation of the ins and outs of each of
the components bulleted above.

Best,

Casey


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread dlyle65535
Github user dlyle65535 commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@mmiklavc - doesn't Enrichment Master's config do the same thing for 
enrichment.properties?


File(format("{metron_config_path}/enrichment.properties"),
 content=Template("enrichment.properties.j2"),
 owner=params.metron_user,
 group=params.metron_group
 )`

And Indexing Master handles elasticsearch.properties (and global.json)?


File("{0}/global.json".format(params.metron_zookeeper_config_path),
 owner=params.metron_user,
 content=InlineTemplate(params.global_json_template)
 )


File("{0}/elasticsearch.properties".format(params.metron_zookeeper_config_path 
+ '/..'),
 owner=params.metron_user,
 content=InlineTemplate(params.global_properties_template))`

That was my intention anyway. Am I mistaken?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley
Casey, good to have controls like this.  Couple questions:

1. Regarding the “index” : “squid” name/value pair, is the index name expected 
to always be a sensor name?  Or is the given json structure subordinate to a 
sensor name in zookeeper?  Or can we build arbitrary indexes with this new 
specification, independent of sensor?  Should there actually be a list of 
“indexes”, ie
{ “indexes” : [
{“index” : “name1”,
…
},
{“index” : “name2”,
…
} ]
}

2. Would the filtering / writer selection logic take place in the indexing 
topology splitter bolt?  Seems like that would have the smallest impact on 
current implementation, no?

Sorry if these are already answered in PR-415, I haven’t had time to review 
that one yet.
Thanks,
--Matt


On 1/12/17, 12:55 PM, "Michael Miklavcic"  wrote:

I like the flexibility and expressibility of the first option with Stellar
filters.

M

On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella  wrote:

> As of METRON-652 , we
> will have decoupled the indexing configuration from the enrichment
> configuration.  As an immediate follow-up to that, I'd like to provide the
> ability to turn off and on writers via the configs.  I'd like to get some
> community feedback on how the functionality should work, if y'all are
> amenable. :)
>
>
> As of now, we have 3 possible writers which can be used in the indexing
> topology:
>
>- Solr
>- Elasticsearch
>- HDFS
>
> HDFS is always used, elasticsearch or solr is used depending on how you
> start the indexing topology.
>
> A couple of proposals come to mind immediately:
>
> *Index Filtering*
>
> You would be able to specify a filter as defined by a stellar statement
> (likely a reuse of the StellarFilter that exists in the Parsers) which
> would allow you to indicate on a message-by-message basis whether or not 
to
> write the message.
>
> The semantics of this would be as follows:
>
>- Default (i.e. unspecified) is to pass everything through (hence
>backwards compatible with the current default config).
>- Messages which have the associated stellar statement evaluate to true
>for the writer type will be written, otherwise not.
>
>
> Sample indexing config which would write out no messages to HDFS and write
> out only messages containing a field called "field1":
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"filters" : {
>   "HDFS" : "false"
>  ,"ES" : "exists(field1)"
>  }
> }
>
> *Index On/Off Switch*
>
> A simpler solution would be to just provide a list of writers to write
> messages.  The semantics would be as follows:
>
>- If the list is unspecified, then the default is to write all messages
>for every writer in the indexing topology
>- If the list is specified, then a writer will write all messages if 
and
>only if it is named in the list.
>
> Sample indexing config which turns off HDFS and keeps on Elasticsearch:
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"writers" : [ "ES" ]
> }
>
> Thanks in advance for the feedback!  Also, if you have any other, better
> ideas than the ones presented here, let me know too.
>
> Best,
>
> Casey
>






[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread mmiklavc
Github user mmiklavc commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@cestella Yeah, if we do a CONFIG_PUT via the Stellar REPL without updating 
the local file system config copy, that's going to cause syncing problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread mmiklavc
Github user mmiklavc commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
OK, found this for the global config in metron-env.xml

```

global-json
global.json template
This is the jinja template for global.json 
file

{
"es.clustername": "{{ es_cluster_name }}",
"es.ip": "{{ es_url }}",
"es.date.format": ".MM.dd.HH"
}


content


```
This is referenced in params_linux.py
```
global_json_template = config['configurations']['metron-env']['global-json']
```
And then it's used by metron_service.py to lay down the config using jinja 
templates (edited for brevity):
```
def init_config():
...
Execute(ambari_format(
"{metron_home}/bin/zk_load_configs.sh --mode PUSH -i 
{metron_zookeeper_config_path} -z {zookeeper_quorum}"),
path=ambari_format("{java_home}/bin")
)
...
def load_global_config(params):
...
File("{0}/global.json".format(params.metron_zookeeper_config_path),
 owner=params.metron_user,
 content=InlineTemplate(params.global_json_template)
 )
...
init_config()
```
So yes, if you change global.json external to Ambari, ie in the Metron 
install config directory, Ambari will rewrite what's in the local FS, and 
follow up with a load to ZK. As best I can tell, this is _only_ applicable to 
the global config, not the individual topology json configs. Those are unpacked 
on install via Ambari performing an RPM install, but not actually managed on an 
ongoing basis. The way you get into hot water here is if you've chosen to 
manage Ambari configs in a different directory than where Ambari believes 
they're located, due to it using zk_load_configs.sh under the hood. Even then, 
this is parameterized via `metron_zookeeper_config_path`. I can't recall if 
that path is absolute, relative, or both. But in metron-env it's defaulted to 
`config/zookeeper`

Hope this clarifies some things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
Ok, thanks @dlyle65535 I don't think it's a regression (since it doesn't 
appear to be any different behavior-wise between when it existed in enrichment 
vs in the new configs), but I do think that we should get a broader discussion 
of ambari management in light of the new management UI that @merrimanr 
submitted and, for that matter, the stellar REPL's `CONFIG_GET` and 
`CONFIG_PUT` functions.  I'll kick off a dev list discussion to try to figure 
out what to do about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread dlyle65535
Github user dlyle65535 commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
Some yes and some no. I'm a bit bleary-eyed today, so I'm not sure if this 
is the complete list- here's my current understanding. Ambari actively (will 
overwrite) manages global.json, enrichment.properties and 
elasticsearch.properties. It passively (will lay down the default from the 
rpms) manages the others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@dlyle65535 I think that's a great point and thanks for clarifying.  
Looking at that file, it appears that ambari is explicitly providing the 
ability to modify the global config and certain flux topology properties files, 
but not a screen to manage the sensor-specific configuration content (i.e. at 
present parsers and enrichment).  Please correct if I have had a reading 
comprehension SNAFU ;)  If so, since we just added a new set of configs under 
the same config directory and provided hooks for `zk_load_utils.sh` to know how 
to load and get them, it shouldn't be any different than the existing configs.

That being said, it is a good point you make.  Let me restate it and you 
tell me if I have the gist.  If people change the configs on an m-pack 
installed cluster via the CLI (via `zk_load_utils.sh`) or the 
soon-to-be-committed management UI, ambari will revert those changes on service 
restart, correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Michael Miklavcic
I like the flexibility and expressibility of the first option with Stellar
filters.

M

On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella  wrote:

> As of METRON-652 , we
> will have decoupled the indexing configuration from the enrichment
> configuration.  As an immediate follow-up to that, I'd like to provide the
> ability to turn off and on writers via the configs.  I'd like to get some
> community feedback on how the functionality should work, if y'all are
> amenable. :)
>
>
> As of now, we have 3 possible writers which can be used in the indexing
> topology:
>
>- Solr
>- Elasticsearch
>- HDFS
>
> HDFS is always used, elasticsearch or solr is used depending on how you
> start the indexing topology.
>
> A couple of proposals come to mind immediately:
>
> *Index Filtering*
>
> You would be able to specify a filter as defined by a stellar statement
> (likely a reuse of the StellarFilter that exists in the Parsers) which
> would allow you to indicate on a message-by-message basis whether or not to
> write the message.
>
> The semantics of this would be as follows:
>
>- Default (i.e. unspecified) is to pass everything through (hence
>backwards compatible with the current default config).
>- Messages which have the associated stellar statement evaluate to true
>for the writer type will be written, otherwise not.
>
>
> Sample indexing config which would write out no messages to HDFS and write
> out only messages containing a field called "field1":
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"filters" : {
>   "HDFS" : "false"
>  ,"ES" : "exists(field1)"
>  }
> }
>
> *Index On/Off Switch*
>
> A simpler solution would be to just provide a list of writers to write
> messages.  The semantics would be as follows:
>
>- If the list is unspecified, then the default is to write all messages
>for every writer in the indexing topology
>- If the list is specified, then a writer will write all messages if and
>only if it is named in the list.
>
> Sample indexing config which turns off HDFS and keeps on Elasticsearch:
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"writers" : [ "ES" ]
> }
>
> Thanks in advance for the feedback!  Also, if you have any other, better
> ideas than the ones presented here, let me know too.
>
> Best,
>
> Casey
>


[DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella
As of METRON-652 , we
will have decoupled the indexing configuration from the enrichment
configuration.  As an immediate follow-up to that, I'd like to provide the
ability to turn off and on writers via the configs.  I'd like to get some
community feedback on how the functionality should work, if y'all are
amenable. :)


As of now, we have 3 possible writers which can be used in the indexing
topology:

   - Solr
   - Elasticsearch
   - HDFS

HDFS is always used, elasticsearch or solr is used depending on how you
start the indexing topology.

A couple of proposals come to mind immediately:

*Index Filtering*

You would be able to specify a filter as defined by a stellar statement
(likely a reuse of the StellarFilter that exists in the Parsers) which
would allow you to indicate on a message-by-message basis whether or not to
write the message.

The semantics of this would be as follows:

   - Default (i.e. unspecified) is to pass everything through (hence
   backwards compatible with the current default config).
   - Messages which have the associated stellar statement evaluate to true
   for the writer type will be written, otherwise not.


Sample indexing config which would write out no messages to HDFS and write
out only messages containing a field called "field1":
{
   "index" : "squid"
  ,"batchSize" : 100
  ,"filters" : {
  "HDFS" : "false"
 ,"ES" : "exists(field1)"
 }
}

*Index On/Off Switch*

A simpler solution would be to just provide a list of writers to write
messages.  The semantics would be as follows:

   - If the list is unspecified, then the default is to write all messages
   for every writer in the indexing topology
   - If the list is specified, then a writer will write all messages if and
   only if it is named in the list.

Sample indexing config which turns off HDFS and keeps on Elasticsearch:
{
   "index" : "squid"
  ,"batchSize" : 100
  ,"writers" : [ "ES" ]
}

Thanks in advance for the feedback!  Also, if you have any other, better
ideas than the ones presented here, let me know too.

Best,

Casey


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread dlyle65535
Github user dlyle65535 commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@merrimanr , @cestella - wrt management of configs from Ambari, the answer 
is kinda. Sensible default configurations are pushed with certain 
user-specified changes that allow the system to function. A complete list can 
be found looking at metron-env.xml.

There is an EXTREMELY important side-effect to this that must not be 
forgotten. Any configuration that has even partial management by Ambari must 
not be modified outside of Ambari if one expects those changes to survive a 
service restart. Ambari will detect a change to the file and overwrite it.

So, I don't know for sure, but I suspect this PR will introduce breaking 
changes to the MPack install which will be corrected during the work on 
METRON-653.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #404: METRON-624: Updated Comparison/Equality Evaluat...

2017-01-12 Thread jjmeyer0
Github user jjmeyer0 commented on the issue:

https://github.com/apache/incubator-metron/pull/404
  
Thanks @cestella. I appreciate it. If no one has already, I'll work on 
those functions we discussed next.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
Testing Instructions beyond the normal smoke test (i.e. letting data
flow through to the indices and checking them).

## Preliminaries

Since I will use the squid topology to pass data through in a controlled
way, we must install squid and generate one point of data:
* `yum install -y squid`
* `service squid start`
* `squidclient http://www.yahoo.com`

Also, set an environment variable to indicate `METRON_HOME`:
* `export METRON_HOME=/usr/metron/0.3.0` 

## Free Up Space on the virtual machine

First, let's free up some headroom on the virtual machine.  If you are 
running this on a
multinode cluster, you would not have to do this.
* Kill monit via `service monit stop`
* Kill tcpreplay via `for i in $(ps -ef | grep tcpreplay | awk '{print 
$2}');do kill -9 $i;done`
* Kill existing parser topologies via 
   * `storm kill snort`
   * `storm kill bro`
* Kill flume via `for i in $(ps -ef | grep flume | awk '{print $2}');do 
kill -9 $i;done`
* Kill yaf via `for i in $(ps -ef | grep yaf | awk '{print $2}');do kill -9 
$i;done`
* Kill bro via `for i in $(ps -ef | grep bro | awk '{print $2}');do kill -9 
$i;done`

## Deploy the squid parser
* Create the squid kafka topic: 
`/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --zookeeper node1:2181 
--create --topic squid --partitions 1 --replication-factor 1`
* Start via `$METRON_HOME/bin/start_parser_topology.sh -k node1:6667 -z 
node1:2181 -s squid`

### Test Case 1: Adjusting batch sizes
* Delete any squid index that currently exists (if any do).
* Create a file at `$METRON_HOME/config/zookeeper/indexing/squid.json` with 
the following contents:
```
{
  "index" : "squid",
  "batchSize" : 5
}
```
* Send 4 data points through and ensure that there are no data points in 
the index:
  * `cat /var/log/squid/access.log /var/log/squid/access.log 
/var/log/squid/access.log /var/log/squid/access.log | 
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list 
node1:6667 --topic squid`
  * `curl "http://localhost:9200/squid*/_search?pretty=true&q=*:*"; 2> 
/dev/null| grep "full_hostname" | wc -l` should yield  `0` 
* Send a final data point through and ensure that we have 5 data points:
  * `cat /var/log/squid/access.log | 
/usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --broker-list 
node1:6667 --topic squid`
  * `curl "http://localhost:9200/squid*/_search?pretty=true&q=*:*"; 2> 
/dev/null| grep "full_hostname" | wc -l` should yield  `5` 
 
### Test Case 2: Update configs from the CLI
* Edit the file at `$METRON_HOME/config/zookeeper/indexing/squid.json` to 
the following contents:
```
{
  "index" : "squid",
  "batchSize" : 10
}
```
* Push the configs: `$METRON_HOME/bin/zk_load_configs.sh -m PUSH -i 
$METRON_HOME/config/zookeeper -z node1:2181`
* Dump the configs and verify the squid indexing config is correct: 
`$METRON_HOME/bin/zk_load_configs.sh -m DUMP -z node1:2181`

### Test Case 3: Stellar Management Functions
* Execute the following in the stellar shell:
```
Stellar, Go!
Please note that functions are loading lazily in the background and will
be unavailable until loaded fully.
{es.clustername=metron, es.ip=node1, es.port=9300,
es.date.format=.MM.dd.HH}
[Stellar]>>> # Grab the indexing config
[Stellar]>>> squid_config := CONFIG_GET('INDEXING', 'squid', true)
Functions loaded, you may refer to functions now...
[Stellar]>>> # Update the index and batch size
[Stellar]>>> squid_config := INDEXING_SET_BATCH( 
INDEXING_SET_INDEX(squid_config, 'squid'), 1)
[Stellar]>>> # Push the config to zookeeper
[Stellar]>>> CONFIG_PUT('INDEXING', squid_config, 'squid')
[Stellar]>>> # Grab the updated config from zookeeper
[Stellar]>>> CONFIG_GET('INDEXING', 'squid')
{
  "index" : "squid",
  "batchSize" : 1
}
```
* Confirm that the dump command from `$METRON_HOME/bin/zk_load_configs.sh 
-m DUMP -z node1:2181` contains the config with batch size of `1`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@merrimanr I was under the impression that there was a panel to manage the 
enrichment configurations per sensor in ambari, but if that's not the case and 
it's just initial load, then I don't see a regression.  @dlyle65535 does this 
sound right to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
What is the regression we're talking about?  Are we talking about loading 
the indexing configs initially with the Mpack or managing the configs in 
Ambari?  

I don't believe we are currently managing parser or enrichment configs in 
Ambari and indexing configs would also fall into that category since they are 
part of enrichment configs now.  Or is that incorrect?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Dev Guide and Committer Review Guide additions?

2017-01-12 Thread Michael Miklavcic
Hi Otto,

You make a great point.

AFA RPM/MPack, we do have some work in the pipeline for streamlining things
a bit with the RPM's and MPack code such that they will be used for
performing the Metron install in the sandbox VM's rather than Ansible. (I'd
search for the public Jiras and post them here, but Jira is down for
maintenance currently.) This should help make it obvious that a change or
new feature requires modifications because they will be in the critical
path to testing.

Documentation is still tricky because we have README files, javadoc, and
the wiki. But in general I think the current approach is to put concrete
functionality docs in the READMEs as much as possible because they can be
tracked and versioned with Git. I think the community has actually been
doing a pretty good job here. The wiki is a little more tricky because
there is typically only one version, which tracks master, not necessarily
the latest stable release.

Mike


On Thu, Jan 12, 2017 at 8:42 AM, Otto Fowler 
wrote:

> As Metron evolves to include new deployment options, features, and
> configurations it is hard and only getting harder for contributors,
> committers, and reviewers to understand what the required changes are
> across the different areas of the system to correctly and completely
> introduce a change or new feature in the system.
>
> We have talked some about the requirements or expectations for submitters
> with regards to tests and coverage, coding style, and documentation  but I
> don’t think we have enough guidance on deployment or other changes that
> need to be considered.  For committers it is pretty much the same, with the
> extra stuff around that process.
>
> Right now it seems as a committer I’m counting on others like Nick or Casey
> to understand anything that may be missing from a submission when I review
> it.  Should there by an Ambari/RPM change?   Does this change the RestAPI?
> Does this effect STELLAR Lang/SHELL?  Does it need customer Docker Compose
> work?  etc etc.
>
> I think as we grow the community and try to get out of incubation it will
> be impractical for us to count on this, and we are even now increasing the
> risk of regression or functional gaps ( unrealized ) that will have an
> adverse effect on having a stable master.
>
> I think we should discuss if and how we can improve this or the issue of my
> sanity ;).
>
> What are the criteria that we need to have submitters and reviewers have in
> mind?
> * Test
> * Doc
> ** Obsoleting of existing documentation/how-to’s ( even hortonworks posts )
> * Performance
> ** How do we test for performance?
> *** Standards
> *** Tools and processes
> * Deployment
> ** RPM
>   ** Docker
> ** Ansible
> ** Ambari
> ** AWS Script
> * Functional
> ** STELLAR/Shell
> ** REST api’s
> * Dev/review guide
> ** Does the review / submit guide need to account for it?
>
> Any thoughts?
>


[GitHub] incubator-metron pull request #400: METRON-636: Capture memory and cpu detai...

2017-01-12 Thread ottobackwards
Github user ottobackwards commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/400#discussion_r95818504
  
--- Diff: metron-deployment/scripts/platform-info.sh ---
@@ -62,3 +62,39 @@ mvn --version
 # operating system
 echo "--"
 uname -a
+
+# system resources
+echo "--"
+case "${OSTYPE}" in
+  linux*)
+cat /proc/meminfo  | grep -i MemTotal | awk '{print "Total System 
Memory = " $2/1024 " MB"}'
+cat /proc/cpuinfo | egrep 'model\ name' | uniq | cut -d: -f2 | awk 
'{print "Processor Model:" $0}'
+cat /proc/cpuinfo | egrep 'cpu\ MHz' | uniq | cut -d: -f2 | awk 
'{print "Processor Speed:" $0 " MHz"}'
+cat /proc/cpuinfo | grep -i '^processor' | wc -l | awk '{print "Total 
Physical Processors: " $0}'
+cat /proc/cpuinfo | grep -i cores | cut -d: -f2 | awk '{corecount+=$1} 
END {print "Total cores: " corecount}'
+echo "Disk information:"
+df -h | grep "^/" 
+;;
+  darwin*)
+sysctl hw.memsize | awk '{print "Total System Memory = " $2/1048576 " 
MB"}'
+sysctl machdep.cpu | grep 'machdep.cpu.brand_string' | cut -d: -f2 | 
cut -d\@ -f1 | awk '{print "Processor Model:" $0}'
+sysctl machdep.cpu | grep 'machdep.cpu.brand_string' | cut -d: -f2 | 
cut -d\@ -f2 | awk '{print "Processor Speed:" $0}'
+sysctl hw.physicalcpu | cut -d: -f2 | awk '{print "Total Physical 
Processors:" $0}'
+sysctl machdep.cpu | grep 'machdep.cpu.core_count' | cut -d: -f2 | cut 
-d\@ -f2 | awk '{print "Total cores:" $0}'
+echo "Disk information:"
+df -h | grep "^/" 
+;;
+  bsd*)
+echo "BSD is not currently supported, unable to detect system 
resources"
+;;
--- End diff --

What the script supports and what Metron supports can be different things, 
there can be bugs logged for improving the script


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[DISCUSS] Dev Guide and Committer Review Guide additions?

2017-01-12 Thread Otto Fowler
As Metron evolves to include new deployment options, features, and
configurations it is hard and only getting harder for contributors,
committers, and reviewers to understand what the required changes are
across the different areas of the system to correctly and completely
introduce a change or new feature in the system.

We have talked some about the requirements or expectations for submitters
with regards to tests and coverage, coding style, and documentation  but I
don’t think we have enough guidance on deployment or other changes that
need to be considered.  For committers it is pretty much the same, with the
extra stuff around that process.

Right now it seems as a committer I’m counting on others like Nick or Casey
to understand anything that may be missing from a submission when I review
it.  Should there by an Ambari/RPM change?   Does this change the RestAPI?
Does this effect STELLAR Lang/SHELL?  Does it need customer Docker Compose
work?  etc etc.

I think as we grow the community and try to get out of incubation it will
be impractical for us to count on this, and we are even now increasing the
risk of regression or functional gaps ( unrealized ) that will have an
adverse effect on having a stable master.

I think we should discuss if and how we can improve this or the issue of my
sanity ;).

What are the criteria that we need to have submitters and reviewers have in
mind?
* Test
* Doc
** Obsoleting of existing documentation/how-to’s ( even hortonworks posts )
* Performance
** How do we test for performance?
*** Standards
*** Tools and processes
* Deployment
** RPM
  ** Docker
** Ansible
** Ambari
** AWS Script
* Functional
** STELLAR/Shell
** REST api’s
* Dev/review guide
** Does the review / submit guide need to account for it?

Any thoughts?


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@ottobackwards yeah, agreed.  It's a fairly complex situation at the moment 
and not documented very well.  That might be worth a discussion on the dev list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #400: METRON-636: Capture memory and cpu detai...

2017-01-12 Thread JonZeolla
Github user JonZeolla commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/400#discussion_r95813473
  
--- Diff: metron-deployment/scripts/platform-info.sh ---
@@ -62,3 +62,39 @@ mvn --version
 # operating system
 echo "--"
 uname -a
+
+# system resources
+echo "--"
+case "${OSTYPE}" in
+  linux*)
+cat /proc/meminfo  | grep -i MemTotal | awk '{print "Total System 
Memory = " $2/1024 " MB"}'
+cat /proc/cpuinfo | egrep 'model\ name' | uniq | cut -d: -f2 | awk 
'{print "Processor Model:" $0}'
+cat /proc/cpuinfo | egrep 'cpu\ MHz' | uniq | cut -d: -f2 | awk 
'{print "Processor Speed:" $0 " MHz"}'
+cat /proc/cpuinfo | grep -i '^processor' | wc -l | awk '{print "Total 
Physical Processors: " $0}'
+cat /proc/cpuinfo | grep -i cores | cut -d: -f2 | awk '{corecount+=$1} 
END {print "Total cores: " corecount}'
+echo "Disk information:"
+df -h | grep "^/" 
+;;
+  darwin*)
+sysctl hw.memsize | awk '{print "Total System Memory = " $2/1048576 " 
MB"}'
+sysctl machdep.cpu | grep 'machdep.cpu.brand_string' | cut -d: -f2 | 
cut -d\@ -f1 | awk '{print "Processor Model:" $0}'
+sysctl machdep.cpu | grep 'machdep.cpu.brand_string' | cut -d: -f2 | 
cut -d\@ -f2 | awk '{print "Processor Speed:" $0}'
+sysctl hw.physicalcpu | cut -d: -f2 | awk '{print "Total Physical 
Processors:" $0}'
+sysctl machdep.cpu | grep 'machdep.cpu.core_count' | cut -d: -f2 | cut 
-d\@ -f2 | awk '{print "Total cores:" $0}'
+echo "Disk information:"
+df -h | grep "^/" 
+;;
+  bsd*)
+echo "BSD is not currently supported, unable to detect system 
resources"
+;;
--- End diff --

Do we have a list of what OSs are supported to run Metron on?  I thought 
BSD, Windows, and Solaris actually weren't supported on the Metron side.  I 
know there is [a list for 
HDP](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_release-notes/content/ch01s02s01.html),
 but I don't feel like that applies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #400: METRON-636: Capture memory and cpu detai...

2017-01-12 Thread nickwallen
Github user nickwallen commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/400#discussion_r95808285
  
--- Diff: metron-deployment/scripts/platform-info.sh ---
@@ -62,3 +62,39 @@ mvn --version
 # operating system
 echo "--"
 uname -a
+
+# system resources
+echo "--"
+case "${OSTYPE}" in
+  linux*)
+cat /proc/meminfo  | grep -i MemTotal | awk '{print "Total System 
Memory = " $2/1024 " MB"}'
+cat /proc/cpuinfo | egrep 'model\ name' | uniq | cut -d: -f2 | awk 
'{print "Processor Model:" $0}'
+cat /proc/cpuinfo | egrep 'cpu\ MHz' | uniq | cut -d: -f2 | awk 
'{print "Processor Speed:" $0 " MHz"}'
+cat /proc/cpuinfo | grep -i '^processor' | wc -l | awk '{print "Total 
Physical Processors: " $0}'
+cat /proc/cpuinfo | grep -i cores | cut -d: -f2 | awk '{corecount+=$1} 
END {print "Total cores: " corecount}'
+echo "Disk information:"
+df -h | grep "^/" 
+;;
+  darwin*)
+sysctl hw.memsize | awk '{print "Total System Memory = " $2/1048576 " 
MB"}'
+sysctl machdep.cpu | grep 'machdep.cpu.brand_string' | cut -d: -f2 | 
cut -d\@ -f1 | awk '{print "Processor Model:" $0}'
+sysctl machdep.cpu | grep 'machdep.cpu.brand_string' | cut -d: -f2 | 
cut -d\@ -f2 | awk '{print "Processor Speed:" $0}'
+sysctl hw.physicalcpu | cut -d: -f2 | awk '{print "Total Physical 
Processors:" $0}'
+sysctl machdep.cpu | grep 'machdep.cpu.core_count' | cut -d: -f2 | cut 
-d\@ -f2 | awk '{print "Total cores:" $0}'
+echo "Disk information:"
+df -h | grep "^/" 
+;;
+  bsd*)
+echo "BSD is not currently supported, unable to detect system 
resources"
+;;
--- End diff --

@anandsubbu This message might be a little misleading.  Could a user 
misinterpret this as meaning that Metron does not support the given OS, versus 
just the platform-info.sh script does not support it?

You could just remove the conditionals for BSD, Windows and Solaris and let 
it fall through to something like "Unable to detect system resources for 
"${OSTYPE}".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
@dlyle65535 It definitely is a regression.  I didn't make it clear enough 
(I will do so now), but I very much do not want this PR to be committed before 
the management pack PR is committed.

The consequences of not managing in ambari is that you won't be able to 
adjust the default configs, which does not stop data from flowing through, but 
are sub-optimal (the index name defaults to the sensor name and the batch size 
defaults to 1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread ottobackwards
Github user ottobackwards commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
Just a side note:  with the rpms and docker everything, it is more 
difficult than ever to have a handle on all the places you need to consider for 
changes, I don't know if you want to discuss on the list but there should be a 
dev. guide entry or something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #413: METRON-654 Create RPM Installer for Profiler

2017-01-12 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/incubator-metron/pull/413
  
Thanks for the work in validating @kylerichardson  and @mattf-horton .  I 
know it is time consuming to test this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread dlyle65535
Github user dlyle65535 commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
Thanks @cestella, makes total sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #415: METRON-652: Extract indexing config from enrich...

2017-01-12 Thread dlyle65535
Github user dlyle65535 commented on the issue:

https://github.com/apache/incubator-metron/pull/415
  
Hi @cestella,

Can you give me a little detail about the consequences of not exposing the 
indexing config to Ambari? It seems like a regression to me, I recall we could 
deploy non-default configs prior to this PR. 

Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #415: METRON-652: Extract indexing config from...

2017-01-12 Thread cestella
GitHub user cestella opened a pull request:

https://github.com/apache/incubator-metron/pull/415

METRON-652: Extract indexing config from enrichment config

Currently, the indexing configuration is bound and coupled with the sensor 
enrichment configuration.  This was done historically because indexing and 
enrichment were part of the same topology.  When the topologies were separated, 
the configurations were never separated, leaving a confusing section about 
indexes in the middle of the enrichment configuration.  This effort will 
separate out the configuration.  

Because of the configuration's simplicity, we are treating the config as 
just a `Map` and supporting the two existing configurations:
* `batchSize`
* `index`

Because this change is fairly deep, a few non-obvious things were also 
updated for consistency and necessity:
* The Stellar Management functions `ENRICHMENT_SET_BATCH` and 
`ENRICHMENT_SET_INDEX` functions have been renamed to `INDEXING_SET_BATCH` and 
`INDEXING_SET_INDEX`
* Support in `CONFIG_GET` and `CONFIG_PUT` for retrieving/storing indexing 
configurations was added.  
* Support for indexing configurations were added to the zookeeper utility 
function
* The indexing RPM was updated to include the new configs for the shipped 
sensors.

I have *not* updated the management pack to create a new panel for managing 
indexing configs.  That will be part of a follow-on JIRA (METRON-653).

I have done smoke testing on quickdev, but will include more comprehensive 
testing explanation in the comments.  I feel that this has the possibility of 
causing regressions, so I want to be explicit in testing methodology.

Furthermore, I plan on creating a wiki section about about migrations, 
because this will necessitate migrating configs when upgrading from 0.3.0 to 
the next version.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cestella/incubator-metron METRON-652

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-metron/pull/415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #415


commit 1d03bc57235af2c1b90c901e7895f3c037a054c2
Author: cstella 
Date:   2017-01-10T14:25:33Z

initial update

commit 4c9227cb618a62bfea63b147c32054bfeadf7d08
Author: cstella 
Date:   2017-01-10T15:48:33Z

Updating testing components.

commit 3681edb32dca2876ecb2c0895c58a392ce558ada
Author: cstella 
Date:   2017-01-10T20:32:47Z

second round of changes.

commit facb6518b27fd19fd5c3f40671b57b84013c4076
Author: cstella 
Date:   2017-01-11T15:03:14Z

Merge branch 'master' into METRON-652

commit ec91c279af078c94c4d762bd8ed166a18fa78da3
Author: cstella 
Date:   2017-01-11T15:03:40Z

Merge branch 'master' into METRON-652

commit ed98341e904575f4d1e89533440cb920ee7a6e3d
Author: cstella 
Date:   2017-01-11T17:54:59Z

updating.

commit 375e0260ce2d12e38e66e8ca05740bf6f71e348c
Author: cstella 
Date:   2017-01-11T21:10:16Z

Updating documentation.

commit 3736294bfe9bb4f3f6c3862af300da49960da81f
Author: cstella 
Date:   2017-01-11T21:15:24Z

Updating RPM.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #409: METRON-644 RPM builds only work with Docker for...

2017-01-12 Thread justinleet
Github user justinleet commented on the issue:

https://github.com/apache/incubator-metron/pull/409
  
I'm +1 by inspection, given that you've run through testing it on a couple 
platforms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [GitHub] incubator-metron issue #400: METRON-636: Capture memory and cpu details as a...

2017-01-12 Thread zeo...@gmail.com
+1 (non-binding)

On Thu, Jan 12, 2017, 6:12 AM anandsubbu  wrote:

> Github user anandsubbu commented on the issue:
>
> https://github.com/apache/incubator-metron/pull/400
>
> Thanks @mattf-horton .
>
> @nickwallen @JonZeolla is there any other thing you guys feel we
> should add to this script? Please let me know.
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>
-- 

Jon

Sent from my mobile device


[GitHub] incubator-metron issue #400: METRON-636: Capture memory and cpu details as a...

2017-01-12 Thread anandsubbu
Github user anandsubbu commented on the issue:

https://github.com/apache/incubator-metron/pull/400
  
Thanks @mattf-horton .

@nickwallen @JonZeolla is there any other thing you guys feel we should add 
to this script? Please let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---