Re: [DISCUSS] Turning off indexing writers feature discussion

Nick Allen Mon, 16 Jan 2017 07:50:12 -0800

I don't quite support it for #1 and #2, but you absolutely sold me on #3.
Good sell.  +1



On Mon, Jan 16, 2017 at 10:46 AM, Casey Stella <ceste...@gmail.com> wrote:

> Well, I like it for a couple of reasons:
>
>    - It's explicit and clear that the writer is on or off
>    - It enables people to keep their writer config in the file without
>    having the writer on (so I don't have to adjust the when clause to
> "false"
>    - It enables us to not have to execute a stellar statement for "off"
>    writers.
>
>
>
> On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen <n...@nickallen.org> wrote:
>
> > I'm all for a compromise here.  Sounds like we're getting close.
> >
> > Just one thing.  Can you layout the reasoning for having 'enabled' and
> > 'when'?  I don't follow the reasoning, but maybe I am missing something.
> >
> > On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson <
> > kylerichards...@gmail.com
> > > wrote:
> >
> > > I'm +1 on the current proposal. I like Nick's syntax and agree with
> Jon's
> > > enabled property. I also like the idea of a path property for HDFS.
> > >
> > > -Kyle
> > >
> > > > On Jan 14, 2017, at 10:51 AM, Casey Stella <ceste...@gmail.com>
> wrote:
> > > >
> > > > I'm +1 on an explicit enabled property and a filter (or when)
> > property. I
> > > > think we are zeroing in on a decent design, so that is good.
> > > >
> > > > To recap, what I am +1 on is Nick's proposed syntax with the
> following
> > > > modifications:
> > > > 1. An explicit enabled field
> > > > 2. A default on for unspecified to match current semantics
> > > >
> > > > Casey
> > > >> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com <zeo...@gmail.com>
> > > wrote:
> > > >>
> > > >> This has the additional benefit of doing something like below when
> you
> > > want
> > > >> to temporarily disable the hdfs writer, but don't want to remove the
> > > >> settings.  This removes the need to store the path and batchSize
> (and
> > > many
> > > >> additional settings) somewhere else so they can be brought back in
> > when
> > > you
> > > >> want to re-enable it, which is a nice workflow attribute for the end
> > > user:
> > > >>
> > > >> {
> > > >>   'elasticsearch': {
> > > >>      'enabled': 'true',
> > > >>      'index': 'foo',
> > > >>      'batchSize': 100,
> > > >>    },
> > > >>   'hdfs': {
> > > >>      'enabled': 'false',
> > > >>      'path': '/foo/bar/...',
> > > >>      'batchSize': 100,
> > > >>    },
> > > >>   'solr': {
> > > >>      'enabled': 'false'
> > > >>    }
> > > >> }
> > > >>
> > > >> Jon
> > > >>
> > > >>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com <zeo...@gmail.com
> >
> > > wrote:
> > > >>>
> > > >>> I similarly have a concern there because I prefer being as explicit
> > as
> > > >>> possible, which makes things easier to pick up for new users.
> Using
> > my
> > > >>> example from earlier this could look like specifying while(false),
> > but
> > > an
> > > >>> even better and more obvious approach may be to use enabled(false).
> > So
> > > >> the
> > > >>> current simple default would be:
> > > >>>
> > > >>> {
> > > >>>   'elasticsearch': { 'enabled': 'true' },
> > > >>>   'hdfs': { 'enabled': 'true' },
> > > >>>   'solr': { enabled': 'false' }
> > > >>> }
> > > >>>
> > > >>> And to use ES with some overrides but not HDFS or solr it would
> look
> > > >> like:
> > > >>>
> > > >>> {
> > > >>>   'elasticsearch': {
> > > >>>      'enabled': 'true',
> > > >>>      'index': 'foo',
> > > >>>      'batchSize': 100
> > > >>>    },
> > > >>>   'hdfs': {
> > > >>>      'enabled': 'false'
> > > >>>    },
> > > >>>   'solr': {
> > > >>>      'enabled': 'false'
> > > >>>    }
> > > >>> }
> > > >>>
> > > >>> Jon
> > > >>>
> > > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <ceste...@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>> One thing that I thought of that I very strenuous do not like in
> > Nick's
> > > >>> proposal is that if a writer config is not specified then it is
> > turned
> > > >> off
> > > >>> (I think; if I misunderstood let me know). In the situation where
> we
> > > >> have a
> > > >>> new sensor, right now if there are no index config and no
> enrichment
> > > >>> config, it still passes through to the index using defaults. In
> this
> > > new
> > > >>> scheme it would not. This changes the default semantics for the
> > system
> > > >> and
> > > >>> I think it changes it for the worse.
> > > >>>
> > > >>> I would strongly prefer a on-by-default indexing config as we have
> > now.
> > > >>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <ceste...@gmail.com>
> > > wrote:
> > > >>>>
> > > >>>> One thing that I really like about Nick's suggestion is that it
> > allows
> > > >>>> writer-specific configs in a clear and simple way.  It is more
> > complex
> > > >>> for
> > > >>>> the default case (all writers write to indices named the same
> thing
> > > >> with
> > > >>> a
> > > >>>> fixed batch size), which I do not like, but maybe it's worth the
> > > >>> compromise
> > > >>>> to make it less complex for the advanced case.
> > > >>>>
> > > >>>> Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> > > beginning
> > > >>> to
> > > >>>> lean your way.
> > > >>>>
> > > >>>> On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com <
> zeo...@gmail.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>> I like the suggestions you made, Nick.  The only thing I would add
> > is
> > > >>> that
> > > >>>> it's also nice to see an explicit when(false), as people newer to
> > the
> > > >>>> platform may not know where to expect configs for the different
> > > >> writers.
> > > >>>> Being able to do it either way, which I think is already assumed
> in
> > > >> your
> > > >>>> model, would make sense.  I would just suggest that, if we support
> > but
> > > >>> are
> > > >>>> disabling a writer, that the platform inserts a default
> when(false)
> > to
> > > >> be
> > > >>>> explicit.
> > > >>>>
> > > >>>> Jon
> > > >>>>
> > > >>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <ceste...@gmail.com
> >
> > > >>> wrote:
> > > >>>>
> > > >>>>> Let me noodle on this over the weekend.  Your syntax is looking
> > less
> > > >>>>> onerous to me and I like the following statement from Otto: "In
> the
> > > >>> end,
> > > >>>>> each write destination ‘type’ will need it’s own configuration.
> > This
> > > >>> is
> > > >>>> an
> > > >>>>> extension point."
> > > >>>>>
> > > >>>>> I may come around to your way of thinking.
> > > >>>>>
> > > >>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
> > > >> ottobackwa...@gmail.com
> > > >>>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> In the end, each write destination ‘type’ will need it’s own
> > > >>>>>> configuration.  This is an extension point.
> > > >>>>>> {
> > > >>>>>> HDFS:{
> > > >>>>>> outputAdapters:[
> > > >>>>>> {name: avro,
> > > >>>>>> settings:{
> > > >>>>>> avro stuff….
> > > >>>>>> when:{
> > > >>>>>> },
> > > >>>>>> {
> > > >>>>>> name: sequence file,
> > > >>>>>> …..
> > > >>>>>>
> > > >>>>>> or some such.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org
> )
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>> I will add also that instead of global overrides, like index, we
> > > >>> should
> > > >>>>> use
> > > >>>>>> configuration key names that are more appropriate to the output.
> > > >>>>>>
> > > >>>>>> For example, does 'index' really make sense for HDFS? Or would
> > > >> 'path'
> > > >>>> be
> > > >>>>>> more appropriate?
> > > >>>>>>
> > > >>>>>> {
> > > >>>>>> 'elasticsearch': {
> > > >>>>>> 'index': 'foo',
> > > >>>>>> 'batchSize': 1
> > > >>>>>> },
> > > >>>>>> 'hdfs': {
> > > >>>>>> 'path': '/foo/bar/...',
> > > >>>>>> 'batchSize': 100
> > > >>>>>> }
> > > >>>>>> }
> > > >>>>>>
> > > >>>>>> Ok, I've said my peace. Thanks for the effort in summarizing all
> > > >>> this,
> > > >>>>>> Casey.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <
> n...@nickallen.org>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>>> Nick's concerns about my suggestion were that it was overly
> > > >> complex
> > > >>>> and
> > > >>>>>>>> hard to grok and that we could dispense with backwards
> > > >>> compatibility
> > > >>>>> and
> > > >>>>>>>> make people do a bit more work on the default case for the
> > > >>> benefits
> > > >>>>> of a
> > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
> > > >>>>> position)
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> I will add is that in my mind, the majority case would be a
> user
> > > >>>>>>> specifying the outputs, but not things like 'batchSize' or
> > > >> 'when'.
> > > >>> I
> > > >>>>>> think
> > > >>>>>>> in the majority case, the user would accept whatever the
> default
> > > >>>> batch
> > > >>>>>> size
> > > >>>>>>> is.
> > > >>>>>>>
> > > >>>>>>> Here are alternatives suggestions for all the examples that you
> > > >>>>> provided
> > > >>>>>>> previously.
> > > >>>>>>>
> > > >>>>>>> Base Case
> > > >>>>>>>
> > > >>>>>>> - The user must always specify the 'outputs' for clarity.
> > > >>>>>>> - Uses default index name, batch size and when = true.
> > > >>>>>>>
> > > >>>>>>> {
> > > >>>>>>> 'elasticsearch': {},
> > > >>>>>>> 'hdfs': {}
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> <
> > > >>>>>> https://gist.github.com/nickwallen/
> 489735b65cdb38aae6e45cec7633a0
> > > >>>>>> a1#writer-non-specific-case>Writer-non-specific
> > > >>>>>>
> > > >>>>>>> Case
> > > >>>>>>>
> > > >>>>>>> - There are no global overrides, as in Casey's proposal.
> > > >>>>>>> - Easier to grok IMO.
> > > >>>>>>>
> > > >>>>>>> {
> > > >>>>>>> 'elasticsearch': {
> > > >>>>>>> 'index': 'foo',
> > > >>>>>>> 'batchSize': 100
> > > >>>>>>> },
> > > >>>>>>> 'hdfs': {
> > > >>>>>>> 'index': 'foo',
> > > >>>>>>> 'batchSize': 100
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> <
> > > >>>>>> https://gist.github.com/nickwallen/
> 489735b65cdb38aae6e45cec7633a0
> > > >>>>>> a1#writer-specific-case-without-filters>Writer-specific
> > > >>>>>>
> > > >>>>>>> case without filters
> > > >>>>>>>
> > > >>>>>>> {
> > > >>>>>>> 'elasticsearch': {
> > > >>>>>>> 'index': 'foo',
> > > >>>>>>> 'batchSize': 1
> > > >>>>>>> },
> > > >>>>>>> 'hdfs': {
> > > >>>>>>> 'index': 'foo',
> > > >>>>>>> 'batchSize': 100
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> <
> > > >>>>>> https://gist.github.com/nickwallen/
> 489735b65cdb38aae6e45cec7633a0
> > > >>>>>> a1#writer-specific-case-with-filters>Writer-specific
> > > >>>>>>
> > > >>>>>>> case with filters
> > > >>>>>>>
> > > >>>>>>> - Instead of having to say when=false, just don't configure
> HDFS
> > > >>>>>>>
> > > >>>>>>> {
> > > >>>>>>> 'elasticsearch': {
> > > >>>>>>> 'index': 'foo',
> > > >>>>>>> 'batchSize': 100,
> > > >>>>>>> 'when': 'exists(field1)'
> > > >>>>>>> }
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <
> > > >> ceste...@gmail.com
> > > >>>>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Dave,
> > > >>>>>>>> For the benefit of posterity and people who might not be as
> > > >> deeply
> > > >>>>>>>> entangled in the system as we have been, I'll recap things and
> > > >>>>> hopefully
> > > >>>>>>>> answer your question in the process.
> > > >>>>>>>>
> > > >>>>>>>> Historically the index configuration is split between the
> > > >>> enrichment
> > > >>>>>>>> configs and the global configs.
> > > >>>>>>>>
> > > >>>>>>>> - The global configs really controls configs that apply to all
> > > >>>>> sensors.
> > > >>>>>>>> Historically this has been stuff like index connection
> strings,
> > > >>> etc.
> > > >>>>>>>> - The sensor-specific configs which control things that vary
> by
> > > >>>>> sensor.
> > > >>>>>>>>
> > > >>>>>>>> As of Metron-652 (in review currently), we moved the sensor
> > > >>> specific
> > > >>>>>>>> configs from the enrichment configs. The proposal here is to
> > > >>>> increase
> > > >>>>>> the
> > > >>>>>>>> granularity of the the sensor specific files to make them
> > > >> support
> > > >>>>> index
> > > >>>>>>>> writer-specific configs. Right now in the indexing topology,
> we
> > > >>>> have 2
> > > >>>>>>>> writers (fixed): ES/Solr and HDFS.
> > > >>>>>>>>
> > > >>>>>>>> The proposed configuration would allow you to either specify a
> > > >>>> blanket
> > > >>>>>>>> sensor-level config for the index name and batchSize and/or
> > > >>> override
> > > >>>>> at
> > > >>>>>>>> the
> > > >>>>>>>> writer level, thereby supporting a couple of use-cases:
> > > >>>>>>>>
> > > >>>>>>>> - Turning off certain index writers (e.g. HDFS)
> > > >>>>>>>> - Filtering the messages written to certain index writers
> > > >>>>>>>>
> > > >>>>>>>> The two competing configs between Nick and I are as follows:
> > > >>>>>>>>
> > > >>>>>>>> - I want to make sure we keep the old sensor-specific defaults
> > > >>> with
> > > >>>>>>>> writer-specific overrides available
> > > >>>>>>>> - Nick thought we could simplify the permutations by making
> the
> > > >>>>>>>> indexing
> > > >>>>>>>> config only the writer-level configs.
> > > >>>>>>>>
> > > >>>>>>>> My concerns about Nick's suggestion were that the default and
> > > >>>> majority
> > > >>>>>>>> case, specifying the index and the batchSize for all writers
> (th
> > > >>>> eone
> > > >>>>> we
> > > >>>>>>>> support now) would require more configuration.
> > > >>>>>>>>
> > > >>>>>>>> Nick's concerns about my suggestion were that it was overly
> > > >>> complex
> > > >>>>> and
> > > >>>>>>>> hard to grok and that we could dispense with backwards
> > > >>> compatibility
> > > >>>>> and
> > > >>>>>>>> make people do a bit more work on the default case for the
> > > >>> benefits
> > > >>>>> of a
> > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your
> > > >>>>> position).
> > > >>>>>>>>
> > > >>>>>>>> Casey
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <
> > > >>> dlyle65...@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Casey,
> > > >>>>>>>>>
> > > >>>>>>>>> Can you give me a level set of what your thinking is now? I
> > > >>> think
> > > >>>>> it's
> > > >>>>>>>>> global control of all index types + overrides on a per-type
> > > >>> basis.
> > > >>>>>> Fwiw,
> > > >>>>>>>>> I'm totally for that, but I want to make sure I'm not
> imposing
> > > >>> my
> > > >>>>>>>>> pre-concieved notions on your consensus-driven ones.
> > > >>>>>>>>>
> > > >>>>>>>>> -D....
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella <
> > > >>>> ceste...@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> I am suggesting that, yes. The configs are essentially the
> > > >>> same
> > > >>>> as
> > > >>>>>>>>> yours,
> > > >>>>>>>>>> except there is an override specified at the top level.
> > > >>> Without
> > > >>>>>>>> that, in
> > > >>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100,
> > > >> you
> > > >>>>> have
> > > >>>>>> to
> > > >>>>>>>>>> explicitly configure each. It's less that I'm trying to have
> > > >>>>>>>> backwards
> > > >>>>>>>>>> compatibility and more that I'm trying to make the majority
> > > >>> case
> > > >>>>>> easy:
> > > >>>>>>>>> both
> > > >>>>>>>>>> writers write everything to a specified index name with a
> > > >>>>> specified
> > > >>>>>>>> batch
> > > >>>>>>>>>> size (which is what we have now). Beyond that, I want to
> > > >> allow
> > > >>>> for
> > > >>>>>>>>>> specifying an override for the config on a writer-by-writer
> > > >>>> basis
> > > >>>>>> for
> > > >>>>>>>>> those
> > > >>>>>>>>>> who need it.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen <
> > > >>>> n...@nickallen.org>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Are you saying we support all of these variants? I realize
> > > >>> you
> > > >>>>> are
> > > >>>>>>>>>> trying
> > > >>>>>>>>>>> to have some backwards compatibility, but this also makes
> > > >> it
> > > >>>>>> harder
> > > >>>>>>>>> for a
> > > >>>>>>>>>>> user to grok (for me at least).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Personally I like my original example as there are fewer
> > > >>>>>>>>> sub-structures,
> > > >>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler
> > > >> and
> > > >>>>>> easier
> > > >>>>>>>> to
> > > >>>>>>>>>>> grok. But maybe others will think your proposal is just as
> > > >>>> easy
> > > >>>>> to
> > > >>>>>>>>> grok.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella <
> > > >>>>>> ceste...@gmail.com>
> > > >>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> - Keeping the configs that we have now (batchSize and
> > > >>> index)
> > > >>>>> as
> > > >>>>>>>>>>> defaults
> > > >>>>>>>>>>>> for the unspecified writer-specific case
> > > >>>>>>>>>>>> - Adding the config Nick suggested
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> *Base Case*:
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> - all writers write all messages
> > > >>>>>>>>>>>> - index named the same as the sensor for all writers
> > > >>>>>>>>>>>> - batchSize of 1 for all writers
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> *Writer-non-specific case*:
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>> "index" : "foo"
> > > >>>>>>>>>>>> ,"batchSize" : 100
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> - All writers write all messages
> > > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > > >> all
> > > >>>>>>>> writers
> > > >>>>>>>>>>>> - batchSize is 100 for all writers
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> *Writer-specific case without filters*
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>> "index" : "foo"
> > > >>>>>>>>>>>> ,"batchSize" : 1
> > > >>>>>>>>>>>> , "writerConfig" :
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>> "elasticsearch" : {
> > > >>>>>>>>>>>> "batchSize" : 100
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> - All writers write all messages
> > > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > > >> all
> > > >>>>>>>> writers
> > > >>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch
> > > >>> writers
> > > >>>>>>>>>>>> - NOTE: I could override the index name too
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> *Writer-specific case with filters*
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>> "index" : "foo"
> > > >>>>>>>>>>>> ,"batchSize" : 1
> > > >>>>>>>>>>>> , "writerConfig" :
> > > >>>>>>>>>>>> {
> > > >>>>>>>>>>>> "elasticsearch" : {
> > > >>>>>>>>>>>> "batchSize" : 100,
> > > >>>>>>>>>>>> "when" : "exists(field1)"
> > > >>>>>>>>>>>> },
> > > >>>>>>>>>>>> "hdfs" : {
> > > >>>>>>>>>>>> "when" : "false"
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>> }
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> - ES writer writes messages which have field1, HDFS
> > > >>> doesn't
> > > >>>>>>>>>>>> - index is named "foo", different from the sensor for
> > > >> all
> > > >>>>>>>> writers
> > > >>>>>>>>>>>> - 100 for elasticsearch writers
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thoughts?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby <
> > > >>>>>>>> cd...@hortonworks.com
> > > >>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> For larger installations you need to control what is
> > > >>>> indexed
> > > >>>>>> so
> > > >>>>>>>> you
> > > >>>>>>>>>>> don’t
> > > >>>>>>>>>>>>> end up with a nasty elastic search situation and so
> > > >> you
> > > >>>> can
> > > >>>>>> mine
> > > >>>>>>>>> the
> > > >>>>>>>>>>> data
> > > >>>>>>>>>>>>> later for reports and training ml models.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks
> > > >>>>>>>>>>>>> Carolyn
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" <
> > > >> ceste...@gmail.com
> > > >>>>
> > > >>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> OH that's a good idea!
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <
> > > >>>>>>>> n...@nickallen.org>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I like the "Index Filtering" option based on the
> > > >>>>>> flexibility
> > > >>>>>>>>> that
> > > >>>>>>>>>> it
> > > >>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have
> > > >> its
> > > >>>> own
> > > >>>>>>>>>>>> configuration
> > > >>>>>>>>>>>>>>> settings? For example, aren't things like batching
> > > >>>>> handled
> > > >>>>>>>>>>> separately
> > > >>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>> HDFS versus Elasticsearch?
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Something along the lines of...
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> {
> > > >>>>>>>>>>>>>>> "hdfs" : {
> > > >>>>>>>>>>>>>>> "when": "exists(field1)",
> > > >>>>>>>>>>>>>>> "batchSize": 100
> > > >>>>>>>>>>>>>>> },
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> "elasticsearch" : {
> > > >>>>>>>>>>>>>>> "when": "true",
> > > >>>>>>>>>>>>>>> "batchSize": 1000,
> > > >>>>>>>>>>>>>>> "index": "squid"
> > > >>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <
> > > >>>>>>>>> ceste...@gmail.com
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any
> > > >>>>> opposition
> > > >>>>>>>> to
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>> anyone?
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The points brought up are good ones and I think
> > > >>> that
> > > >>>> it
> > > >>>>>>>> may be
> > > >>>>>>>>>>>> worth a
> > > >>>>>>>>>>>>>>>> broader discussion of the requirements of
> > > >> indexing
> > > >>>> in a
> > > >>>>>>>>> separate
> > > >>>>>>>>>>> dev
> > > >>>>>>>>>>>>> list
> > > >>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent
> > > >>>> use-cases
> > > >>>>>>>>>>> justifying
> > > >>>>>>>>>>>>> them
> > > >>>>>>>>>>>>>>> so
> > > >>>>>>>>>>>>>>>> we can think about how this stuff should work and
> > > >>>> where
> > > >>>>>> the
> > > >>>>>>>>>>> natural
> > > >>>>>>>>>>>>>>>> extension points should be. Afterall, we need to
> > > >>> toe
> > > >>>>> the
> > > >>>>>>>> line
> > > >>>>>>>>>>>> between
> > > >>>>>>>>>>>>>>>> engineering and overengineering for features
> > > >> nobody
> > > >>>>> will
> > > >>>>>>>> want.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard
> > > >>>>> fields.
> > > >>>>>>>> I'm
> > > >>>>>>>>>>> torn
> > > >>>>>>>>>>>>>>> between
> > > >>>>>>>>>>>>>>>> the notions that we should have no standard
> > > >> fields
> > > >>> vs
> > > >>>>> we
> > > >>>>>>>>> should
> > > >>>>>>>>>>>> have a
> > > >>>>>>>>>>>>>>>> boatload of standard fields (with most of them
> > > >>>> empty).
> > > >>>>> I
> > > >>>>>>>>>> exchange
> > > >>>>>>>>>>>>>>>> positions fairly regularly on that question. ;)
> > > >> It
> > > >>>> may
> > > >>>>> be
> > > >>>>>>>>>> worth a
> > > >>>>>>>>>>>> dev
> > > >>>>>>>>>>>>>>> list
> > > >>>>>>>>>>>>>>>> discussion to lay out how you imagine an
> > > >> extension
> > > >>> of
> > > >>>>>>>> standard
> > > >>>>>>>>>>>> fields
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> how it might look as implemented in Metron.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Casey
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Casey
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson
> > > >> <
> > > >>>>>>>>>>>>>>>> kylerichards...@gmail.com>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I'll second my preference for the first
> > > >> option. I
> > > >>>>> think
> > > >>>>>>>> the
> > > >>>>>>>>>>>> ability
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> use
> > > >>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be
> > > >> a
> > > >>>> big
> > > >>>>>> win.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data
> > > >>> lake
> > > >>>>> and
> > > >>>>>>>> CEP.
> > > >>>>>>>>> I
> > > >>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>> a really important use case that we need to
> > > >>>> consider.
> > > >>>>>>>> Take a
> > > >>>>>>>>>>>> simple
> > > >>>>>>>>>>>>>>>>> example... If I have data coming in from 3
> > > >>>> different
> > > >>>>>>>>> firewall
> > > >>>>>>>>>>>>> vendors
> > > >>>>>>>>>>>>>>>> and 2
> > > >>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I
> > > >>>> want
> > > >>>>> to
> > > >>>>>>>> be
> > > >>>>>>>>>> able
> > > >>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> analyze
> > > >>>>>>>>>>>>>>>>> that data set, I need the data to be indexed
> > > >> all
> > > >>>>>> together
> > > >>>>>>>>>>> (likely
> > > >>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>> HDFS)
> > > >>>>>>>>>>>>>>>>> and to have a normalized schema such that IP
> > > >>>> address,
> > > >>>>>>>> URL,
> > > >>>>>>>>> and
> > > >>>>>>>>>>>> user
> > > >>>>>>>>>>>>>>> name
> > > >>>>>>>>>>>>>>>>> (to take a few) can be easily queried and
> > > >>>>> aggregated. I
> > > >>>>>>>> can
> > > >>>>>>>>>> also
> > > >>>>>>>>>>>>>>> envision
> > > >>>>>>>>>>>>>>>>> scenarios where I would want to index data
> > > >> based
> > > >>> on
> > > >>>>>>>>> attributes
> > > >>>>>>>>>>>> other
> > > >>>>>>>>>>>>>>> than
> > > >>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for
> > > >> example.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7
> > > >>>> standard
> > > >>>>>>>> fields
> > > >>>>>>>>> to
> > > >>>>>>>>>>>>> include
> > > >>>>>>>>>>>>>>>>> things like URL and user. Is there community
> > > >>>>>>>>> interest/support
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>>>>> moving
> > > >>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> -Kyle
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <
> > > >>>>>>>>> ma...@apache.org
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index
> > > >> name
> > > >>>>>> allows
> > > >>>>>>>>>> using
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can
> > > >> be
> > > >>>>>>>> achieved.
> > > >>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>> --Matt
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" <
> > > >>>>>>>> ceste...@gmail.com>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog
> > > >>> parser
> > > >>>>>>>>> with
> > > >>>>>>>>>>> data
> > > >>>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>>>> sources 1
> > > >>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue
> > > >>> with 3
> > > >>>>>>>>>> parsers
> > > >>>>>>>>>>>>>>> attached
> > > >>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>> queue, each picking part the messages from
> > > >>> source
> > > >>>>>>>> 1, 2
> > > >>>>>>>>>> and
> > > >>>>>>>>>>>> 3.
> > > >>>>>>>>>>>>>>>> They'd
> > > >>>>>>>>>>>>>>>>>> go
> > > >>>>>>>>>>>>>>>>>> through separate enrichment and into the
> > > >>> indexing
> > > >>>>>>>>>>> topology.
> > > >>>>>>>>>>>>> In
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> indexing topology, you could specify the same
> > > >>>> index
> > > >>>>>>>>> name
> > > >>>>>>>>>>>>> "syslog"
> > > >>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>> of the messages go into the same index for
> > > >> CEP
> > > >>>>>>>>> querying
> > > >>>>>>>>>> if
> > > >>>>>>>>>>>> so
> > > >>>>>>>>>>>>>>>>> desired.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <
> > > >>>>>>>>>>>> ma...@apache.org
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I
> > > >> worked
> > > >>> at
> > > >>>>>>>>>> LogLogic
> > > >>>>>>>>>>>> in
> > > >>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> previous
> > > >>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route
> > > >>> different
> > > >>>>>>>>> lines
> > > >>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>> syslog
> > > >>>>>>>>>>>>>>>>>> through
> > > >>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of
> > > >>>> what
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>> parsers
> > > >>>>>>>>>>>>>>> do
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and
> > > >>>>>>>> annotate
> > > >>>>>>>>>> it
> > > >>>>>>>>>>> –
> > > >>>>>>>>>>>>> eg,
> > > >>>>>>>>>>>>>>>>>> src_ip_addr,
> > > >>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata
> > > >>> are
> > > >>>>>>>>>>> annotated
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> available
> > > >>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it
> > > >> make
> > > >>>>>>>> sense
> > > >>>>>>>>> to
> > > >>>>>>>>>>>> index
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> messages
> > > >>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk
> > > >>> has
> > > >>>>>>>>>>>> illustrated
> > > >>>>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>> model.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" <
> > > >>>>>>>>>> ceste...@gmail.com
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the
> > > >> approach
> > > >>>>>>>>> that
> > > >>>>>>>>>>>> we've
> > > >>>>>>>>>>>>>>> taken
> > > >>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>> sources
> > > >>>>>>>>>>>>>>>>>>> which aggregate different types of data is
> > > >> to
> > > >>>>>>>>>>> provide
> > > >>>>>>>>>>>>>>> filters
> > > >>>>>>>>>>>>>>>>> at
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> parser
> > > >>>>>>>>>>>>>>>>>>> level and have multiple parser topologies
> > > >>>>>>>> (with
> > > >>>>>>>>>>>>> different,
> > > >>>>>>>>>>>>>>>>>> possibly
> > > >>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This
> > > >>>>>>>> would
> > > >>>>>>>>>> be
> > > >>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>> completely
> > > >>>>>>>>>>>>>>>>>>> separate
> > > >>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that
> > > >>>>>>>>>>> aggregates
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>> want to
> > > >>>>>>>>>>>>>>>>>>> pick
> > > >>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is
> > > >>>>>>>> why
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>> initial
> > > >>>>>>>>>>>>>>>>>> thought and
> > > >>>>>>>>>>>>>>>>>>> architecture was one index per sensor.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt
> > > >> Foley <
> > > >>>>>>>>>>>>>>>> ma...@apache.org>
> > > >>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event
> > > >>>>>>>>> Processing)
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>> contrary
> > > >>>>>>>>>>>>>>>>>> to the
> > > >>>>>>>>>>>>>>>>>>> idea
> > > >>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor.
> > > >>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors
> > > >>>>>>>> are
> > > >>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>> aggregating
> > > >>>>>>>>>>>>>>>>>>> data from
> > > >>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong
> > > >> here.
> > > >>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data
> > > >>>>>>>> lake”
> > > >>>>>>>>>>>> insights
> > > >>>>>>>>>>>>>>> come
> > > >>>>>>>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>>>>> being able
> > > >>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of
> > > >>>>>>>> data
> > > >>>>>>>>>>> rather
> > > >>>>>>>>>>>>> than
> > > >>>>>>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>>>>> vertical
> > > >>>>>>>>>>>>>>>>>>>> slices of it.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" <
> > > >>>>>>>>>>>>> ceste...@gmail.com>
> > > >>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hey Matt,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks for the comment!
> > > >>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one
> > > >>>>>>>> index
> > > >>>>>>>>>> name,
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> default
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> which is
> > > >>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to
> > > >>>>>>>> the
> > > >>>>>>>>>>> user.
> > > >>>>>>>>>>>>> This
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>> sensor
> > > >>>>>>>>>>>>>>>>>>>> specific,
> > > >>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each
> > > >>>>>>>>>> sensor.
> > > >>>>>>>>>>>> If
> > > >>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>> want
> > > >>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>> build
> > > >>>>>>>>>>>>>>>>>>>> multiple
> > > >>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think
> > > >>>>>>>>>>> carefully
> > > >>>>>>>>>>>>>>> about
> > > >>>>>>>>>>>>>>>>> how
> > > >>>>>>>>>>>>>>>>>> to do
> > > >>>>>>>>>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I
> > > >>>>>>>> guess I
> > > >>>>>>>>>> can
> > > >>>>>>>>>>>> see
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> use,
> > > >>>>>>>>>>>>>>>>>> though
> > > >>>>>>>>>>>>>>>>>>>> (redirect
> > > >>>>>>>>>>>>>>>>>>>> messages to one index vs another based
> > > >>>>>>>> on
> > > >>>>>>>>> a
> > > >>>>>>>>>>>>> predicate
> > > >>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>> a given
> > > >>>>>>>>>>>>>>>>>>>> sensor).
> > > >>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally
> > > >>>>>>>>> thinking
> > > >>>>>>>>>>> that
> > > >>>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>>>>> discussion
> > > >>>>>>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>>>>> go,
> > > >>>>>>>>>>>>>>>>>>>> but it's an interesting point.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the
> > > >>>>>>>>>> implementation
> > > >>>>>>>>>>>>> quite
> > > >>>>>>>>>>>>>>>> yet,
> > > >>>>>>>>>>>>>>>>>> but we
> > > >>>>>>>>>>>>>>>>>>> don't
> > > >>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that
> > > >>>>>>>>>>> topology,
> > > >>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> spout
> > > >>>>>>>>>>>>>>>>>>> that goes
> > > >>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to
> > > >>>>>>>> the
> > > >>>>>>>>>> hdfs
> > > >>>>>>>>>>>>> writer.
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt
> > > >>>>>>>>> Foley
> > > >>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>> ma...@apache.org>
> > > >>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like
> > > >>>>>>>> this.
> > > >>>>>>>>>>>> Couple
> > > >>>>>>>>>>>>>>>>>> questions:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid”
> > > >>>>>>>>>>> name/value
> > > >>>>>>>>>>>>> pair,
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> index name
> > > >>>>>>>>>>>>>>>>>>>>> expected to always be a sensor
> > > >>>>>>>> name? Or
> > > >>>>>>>>>> is
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> given
> > > >>>>>>>>>>>>>>>>>> json
> > > >>>>>>>>>>>>>>>>>>> structure
> > > >>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in
> > > >>>>>>>>> zookeeper?
> > > >>>>>>>>>>> Or
> > > >>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>> build
> > > >>>>>>>>>>>>>>>>>>> arbitrary
> > > >>>>>>>>>>>>>>>>>>>>> indexes with this new specification,
> > > >>>>>>>>>>>>> independent of
> > > >>>>>>>>>>>>>>>>>> sensor?
> > > >>>>>>>>>>>>>>>>>>> Should
> > > >>>>>>>>>>>>>>>>>>>> there
> > > >>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie
> > > >>>>>>>>>>>>>>>>>>>>> { “indexes” : [
> > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name1”,
> > > >>>>>>>>>>>>>>>>>>>>> …
> > > >>>>>>>>>>>>>>>>>>>>> },
> > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name2”,
> > > >>>>>>>>>>>>>>>>>>>>> …
> > > >>>>>>>>>>>>>>>>>>>>> } ]
> > > >>>>>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer
> > > >>>>>>>>> selection
> > > >>>>>>>>>>>> logic
> > > >>>>>>>>>>>>>>> take
> > > >>>>>>>>>>>>>>>>>> place in
> > > >>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> indexing
> > > >>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like
> > > >>>>>>>> that
> > > >>>>>>>>>>> would
> > > >>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>> smallest
> > > >>>>>>>>>>>>>>>>>>>> impact on
> > > >>>>>>>>>>>>>>>>>>>>> current implementation, no?
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered
> > > >>>>>>>> in
> > > >>>>>>>>>>>> PR-415, I
> > > >>>>>>>>>>>>>>>>> haven’t
> > > >>>>>>>>>>>>>>>>>> had
> > > >>>>>>>>>>>>>>>>>>> time to
> > > >>>>>>>>>>>>>>>>>>>>> review that one yet.
> > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>> --Matt
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael
> > > >>>>>>>>> Miklavcic"
> > > >>>>>>>>>> <
> > > >>>>>>>>>>>>>>>>>>>> michael.miklav...@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> I like the flexibility and
> > > >>>>>>>>>>> expressibility
> > > >>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> first
> > > >>>>>>>>>>>>>>>>>>> option
> > > >>>>>>>>>>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>>>>>>>> Stellar
> > > >>>>>>>>>>>>>>>>>>>>> filters.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> M
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM,
> > > >>>>>>>>> Casey
> > > >>>>>>>>>>>>> Stella <
> > > >>>>>>>>>>>>>>>>>>>> ceste...@gmail.com>
> > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> As of METRON-652 <
> > > >>>>>>>>>>>>> https://github.com/apache/
> > > >>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we
> > > >>>>>>>>>>>>>>>>>>>>>> will have decoupled the
> > > >>>>>>>> indexing
> > > >>>>>>>>>>>>>>> configuration
> > > >>>>>>>>>>>>>>>>>> from the
> > > >>>>>>>>>>>>>>>>>>>> enrichment
> > > >>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate
> > > >>>>>>>>>>>> follow-up
> > > >>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> that,
> > > >>>>>>>>>>>>>>>>>> I'd
> > > >>>>>>>>>>>>>>>>>>> like to
> > > >>>>>>>>>>>>>>>>>>>>> provide the
> > > >>>>>>>>>>>>>>>>>>>>>> ability to turn off and on
> > > >>>>>>>> writers
> > > >>>>>>>>>> via
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> configs. I'd
> > > >>>>>>>>>>>>>>>>>>> like
> > > >>>>>>>>>>>>>>>>>>>> to get
> > > >>>>>>>>>>>>>>>>>>>>> some
> > > >>>>>>>>>>>>>>>>>>>>>> community feedback on how the
> > > >>>>>>>>>>>>> functionality
> > > >>>>>>>>>>>>>>>>> should
> > > >>>>>>>>>>>>>>>>>> work,
> > > >>>>>>>>>>>>>>>>>>> if
> > > >>>>>>>>>>>>>>>>>>>> y'all are
> > > >>>>>>>>>>>>>>>>>>>>>> amenable. :)
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible
> > > >>>>>>>>>> writers
> > > >>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>> can
> > > >>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>> used
> > > >>>>>>>>>>>>>>>>>>> in the
> > > >>>>>>>>>>>>>>>>>>>>> indexing
> > > >>>>>>>>>>>>>>>>>>>>>> topology:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> - Solr
> > > >>>>>>>>>>>>>>>>>>>>>> - Elasticsearch
> > > >>>>>>>>>>>>>>>>>>>>>> - HDFS
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> HDFS is always used,
> > > >>>>>>>> elasticsearch
> > > >>>>>>>>>> or
> > > >>>>>>>>>>>>> solr is
> > > >>>>>>>>>>>>>>>>> used
> > > >>>>>>>>>>>>>>>>>>> depending
> > > >>>>>>>>>>>>>>>>>>>> on how
> > > >>>>>>>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>>>>>> start the indexing topology.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to
> > > >>>>>>>> mind
> > > >>>>>>>>>>>>>>> immediately:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> *Index Filtering*
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> You would be able to specify a
> > > >>>>>>>>>> filter
> > > >>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>>>> by a
> > > >>>>>>>>>>>>>>>>>>> stellar
> > > >>>>>>>>>>>>>>>>>>>>> statement
> > > >>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the
> > > >>>>>>>>> StellarFilter
> > > >>>>>>>>>>>> that
> > > >>>>>>>>>>>>>>>> exists
> > > >>>>>>>>>>>>>>>>>> in the
> > > >>>>>>>>>>>>>>>>>>>> Parsers)
> > > >>>>>>>>>>>>>>>>>>>>> which
> > > >>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on
> > > >>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> message-by-message basis
> > > >>>>>>>>>>>>>>>>>>>> whether or
> > > >>>>>>>>>>>>>>>>>>>>> not to
> > > >>>>>>>>>>>>>>>>>>>>>> write the message.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> The semantics of this would be
> > > >>>>>>>> as
> > > >>>>>>>>>>>> follows:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> - Default (i.e.
> > > >>>>>>>> unspecified) is
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>> pass
> > > >>>>>>>>>>>>>>>>>> everything
> > > >>>>>>>>>>>>>>>>>>> through
> > > >>>>>>>>>>>>>>>>>>>> (hence
> > > >>>>>>>>>>>>>>>>>>>>>> backwards compatible with
> > > >>>>>>>> the
> > > >>>>>>>>>>> current
> > > >>>>>>>>>>>>>>>> default
> > > >>>>>>>>>>>>>>>>>> config).
> > > >>>>>>>>>>>>>>>>>>>>>> - Messages which have the
> > > >>>>>>>>>>> associated
> > > >>>>>>>>>>>>>>> stellar
> > > >>>>>>>>>>>>>>>>>> statement
> > > >>>>>>>>>>>>>>>>>>>> evaluate
> > > >>>>>>>>>>>>>>>>>>>>> to true
> > > >>>>>>>>>>>>>>>>>>>>>> for the writer type will be
> > > >>>>>>>>>>> written,
> > > >>>>>>>>>>>>>>>> otherwise
> > > >>>>>>>>>>>>>>>>>> not.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> > > >>>>>>>> would
> > > >>>>>>>>>>> write
> > > >>>>>>>>>>>>> out
> > > >>>>>>>>>>>>>>> no
> > > >>>>>>>>>>>>>>>>>> messages
> > > >>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> HDFS and
> > > >>>>>>>>>>>>>>>>>>>>> write
> > > >>>>>>>>>>>>>>>>>>>>>> out only messages containing a
> > > >>>>>>>>> field
> > > >>>>>>>>>>>>> called
> > > >>>>>>>>>>>>>>>>>> "field1":
> > > >>>>>>>>>>>>>>>>>>>>>> {
> > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> > > >>>>>>>>>>>>>>>>>>>>>> ,"filters" : {
> > > >>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false"
> > > >>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)"
> > > >>>>>>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch*
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to
> > > >>>>>>>>> just
> > > >>>>>>>>>>>>> provide a
> > > >>>>>>>>>>>>>>>>> list
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>> writers
> > > >>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>> write
> > > >>>>>>>>>>>>>>>>>>>>>> messages. The semantics would
> > > >>>>>>>> be
> > > >>>>>>>>> as
> > > >>>>>>>>>>>>> follows:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> - If the list is
> > > >>>>>>>> unspecified,
> > > >>>>>>>>>> then
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> default
> > > >>>>>>>>>>>>>>>>>> is to
> > > >>>>>>>>>>>>>>>>>>> write
> > > >>>>>>>>>>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>>>>> messages
> > > >>>>>>>>>>>>>>>>>>>>>> for every writer in the
> > > >>>>>>>>> indexing
> > > >>>>>>>>>>>>> topology
> > > >>>>>>>>>>>>>>>>>>>>>> - If the list is specified,
> > > >>>>>>>>> then
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>> writer
> > > >>>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>> write
> > > >>>>>>>>>>>>>>>>>>> all
> > > >>>>>>>>>>>>>>>>>>>> messages
> > > >>>>>>>>>>>>>>>>>>>>> if and
> > > >>>>>>>>>>>>>>>>>>>>>> only if it is named in the
> > > >>>>>>>>> list.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which
> > > >>>>>>>> turns
> > > >>>>>>>>>> off
> > > >>>>>>>>>>>>> HDFS
> > > >>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> keeps on
> > > >>>>>>>>>>>>>>>>>>>>> Elasticsearch:
> > > >>>>>>>>>>>>>>>>>>>>>> {
> > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid"
> > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100
> > > >>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ]
> > > >>>
> > > >>> --
> > > >>
> > > >> Jon
> > > >>
> > > >> Sent from my mobile device
> > > >>
> > >
> > >
> >
> >
> > --
> > Nick Allen <n...@nickallen.org>
> >
>



-- 
Nick Allen <n...@nickallen.org>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to