I don't quite support it for #1 and #2, but you absolutely sold me on #3. Good sell. +1
On Mon, Jan 16, 2017 at 10:46 AM, Casey Stella <ceste...@gmail.com> wrote: > Well, I like it for a couple of reasons: > > - It's explicit and clear that the writer is on or off > - It enables people to keep their writer config in the file without > having the writer on (so I don't have to adjust the when clause to > "false" > - It enables us to not have to execute a stellar statement for "off" > writers. > > > > On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen <n...@nickallen.org> wrote: > > > I'm all for a compromise here. Sounds like we're getting close. > > > > Just one thing. Can you layout the reasoning for having 'enabled' and > > 'when'? I don't follow the reasoning, but maybe I am missing something. > > > > On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson < > > kylerichards...@gmail.com > > > wrote: > > > > > I'm +1 on the current proposal. I like Nick's syntax and agree with > Jon's > > > enabled property. I also like the idea of a path property for HDFS. > > > > > > -Kyle > > > > > > > On Jan 14, 2017, at 10:51 AM, Casey Stella <ceste...@gmail.com> > wrote: > > > > > > > > I'm +1 on an explicit enabled property and a filter (or when) > > property. I > > > > think we are zeroing in on a decent design, so that is good. > > > > > > > > To recap, what I am +1 on is Nick's proposed syntax with the > following > > > > modifications: > > > > 1. An explicit enabled field > > > > 2. A default on for unspecified to match current semantics > > > > > > > > Casey > > > >> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com <zeo...@gmail.com> > > > wrote: > > > >> > > > >> This has the additional benefit of doing something like below when > you > > > want > > > >> to temporarily disable the hdfs writer, but don't want to remove the > > > >> settings. This removes the need to store the path and batchSize > (and > > > many > > > >> additional settings) somewhere else so they can be brought back in > > when > > > you > > > >> want to re-enable it, which is a nice workflow attribute for the end > > > user: > > > >> > > > >> { > > > >> 'elasticsearch': { > > > >> 'enabled': 'true', > > > >> 'index': 'foo', > > > >> 'batchSize': 100, > > > >> }, > > > >> 'hdfs': { > > > >> 'enabled': 'false', > > > >> 'path': '/foo/bar/...', > > > >> 'batchSize': 100, > > > >> }, > > > >> 'solr': { > > > >> 'enabled': 'false' > > > >> } > > > >> } > > > >> > > > >> Jon > > > >> > > > >>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com <zeo...@gmail.com > > > > > wrote: > > > >>> > > > >>> I similarly have a concern there because I prefer being as explicit > > as > > > >>> possible, which makes things easier to pick up for new users. > Using > > my > > > >>> example from earlier this could look like specifying while(false), > > but > > > an > > > >>> even better and more obvious approach may be to use enabled(false). > > So > > > >> the > > > >>> current simple default would be: > > > >>> > > > >>> { > > > >>> 'elasticsearch': { 'enabled': 'true' }, > > > >>> 'hdfs': { 'enabled': 'true' }, > > > >>> 'solr': { enabled': 'false' } > > > >>> } > > > >>> > > > >>> And to use ES with some overrides but not HDFS or solr it would > look > > > >> like: > > > >>> > > > >>> { > > > >>> 'elasticsearch': { > > > >>> 'enabled': 'true', > > > >>> 'index': 'foo', > > > >>> 'batchSize': 100 > > > >>> }, > > > >>> 'hdfs': { > > > >>> 'enabled': 'false' > > > >>> }, > > > >>> 'solr': { > > > >>> 'enabled': 'false' > > > >>> } > > > >>> } > > > >>> > > > >>> Jon > > > >>> > > > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <ceste...@gmail.com> > > > >> wrote: > > > >>> > > > >>> One thing that I thought of that I very strenuous do not like in > > Nick's > > > >>> proposal is that if a writer config is not specified then it is > > turned > > > >> off > > > >>> (I think; if I misunderstood let me know). In the situation where > we > > > >> have a > > > >>> new sensor, right now if there are no index config and no > enrichment > > > >>> config, it still passes through to the index using defaults. In > this > > > new > > > >>> scheme it would not. This changes the default semantics for the > > system > > > >> and > > > >>> I think it changes it for the worse. > > > >>> > > > >>> I would strongly prefer a on-by-default indexing config as we have > > now. > > > >>>> On Fri, Jan 13, 2017 at 17:13 Casey Stella <ceste...@gmail.com> > > > wrote: > > > >>>> > > > >>>> One thing that I really like about Nick's suggestion is that it > > allows > > > >>>> writer-specific configs in a clear and simple way. It is more > > complex > > > >>> for > > > >>>> the default case (all writers write to indices named the same > thing > > > >> with > > > >>> a > > > >>>> fixed batch size), which I do not like, but maybe it's worth the > > > >>> compromise > > > >>>> to make it less complex for the advanced case. > > > >>>> > > > >>>> Thanks a lot for the suggestion, Nick, it's interesting; I'm > > > beginning > > > >>> to > > > >>>> lean your way. > > > >>>> > > > >>>> On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com < > zeo...@gmail.com > > > > > > >>>> wrote: > > > >>>> > > > >>>> I like the suggestions you made, Nick. The only thing I would add > > is > > > >>> that > > > >>>> it's also nice to see an explicit when(false), as people newer to > > the > > > >>>> platform may not know where to expect configs for the different > > > >> writers. > > > >>>> Being able to do it either way, which I think is already assumed > in > > > >> your > > > >>>> model, would make sense. I would just suggest that, if we support > > but > > > >>> are > > > >>>> disabling a writer, that the platform inserts a default > when(false) > > to > > > >> be > > > >>>> explicit. > > > >>>> > > > >>>> Jon > > > >>>> > > > >>>> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <ceste...@gmail.com > > > > > >>> wrote: > > > >>>> > > > >>>>> Let me noodle on this over the weekend. Your syntax is looking > > less > > > >>>>> onerous to me and I like the following statement from Otto: "In > the > > > >>> end, > > > >>>>> each write destination ‘type’ will need it’s own configuration. > > This > > > >>> is > > > >>>> an > > > >>>>> extension point." > > > >>>>> > > > >>>>> I may come around to your way of thinking. > > > >>>>> > > > >>>>> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler < > > > >> ottobackwa...@gmail.com > > > >>>> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> In the end, each write destination ‘type’ will need it’s own > > > >>>>>> configuration. This is an extension point. > > > >>>>>> { > > > >>>>>> HDFS:{ > > > >>>>>> outputAdapters:[ > > > >>>>>> {name: avro, > > > >>>>>> settings:{ > > > >>>>>> avro stuff…. > > > >>>>>> when:{ > > > >>>>>> }, > > > >>>>>> { > > > >>>>>> name: sequence file, > > > >>>>>> ….. > > > >>>>>> > > > >>>>>> or some such. > > > >>>>>> > > > >>>>>> > > > >>>>>> On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org > ) > > > >>>> wrote: > > > >>>>>> > > > >>>>>> I will add also that instead of global overrides, like index, we > > > >>> should > > > >>>>> use > > > >>>>>> configuration key names that are more appropriate to the output. > > > >>>>>> > > > >>>>>> For example, does 'index' really make sense for HDFS? Or would > > > >> 'path' > > > >>>> be > > > >>>>>> more appropriate? > > > >>>>>> > > > >>>>>> { > > > >>>>>> 'elasticsearch': { > > > >>>>>> 'index': 'foo', > > > >>>>>> 'batchSize': 1 > > > >>>>>> }, > > > >>>>>> 'hdfs': { > > > >>>>>> 'path': '/foo/bar/...', > > > >>>>>> 'batchSize': 100 > > > >>>>>> } > > > >>>>>> } > > > >>>>>> > > > >>>>>> Ok, I've said my peace. Thanks for the effort in summarizing all > > > >>> this, > > > >>>>>> Casey. > > > >>>>>> > > > >>>>>> > > > >>>>>> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen < > n...@nickallen.org> > > > >>>> wrote: > > > >>>>>> > > > >>>>>>> Nick's concerns about my suggestion were that it was overly > > > >> complex > > > >>>> and > > > >>>>>>>> hard to grok and that we could dispense with backwards > > > >>> compatibility > > > >>>>> and > > > >>>>>>>> make people do a bit more work on the default case for the > > > >>> benefits > > > >>>>> of a > > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your > > > >>>>> position) > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> I will add is that in my mind, the majority case would be a > user > > > >>>>>>> specifying the outputs, but not things like 'batchSize' or > > > >> 'when'. > > > >>> I > > > >>>>>> think > > > >>>>>>> in the majority case, the user would accept whatever the > default > > > >>>> batch > > > >>>>>> size > > > >>>>>>> is. > > > >>>>>>> > > > >>>>>>> Here are alternatives suggestions for all the examples that you > > > >>>>> provided > > > >>>>>>> previously. > > > >>>>>>> > > > >>>>>>> Base Case > > > >>>>>>> > > > >>>>>>> - The user must always specify the 'outputs' for clarity. > > > >>>>>>> - Uses default index name, batch size and when = true. > > > >>>>>>> > > > >>>>>>> { > > > >>>>>>> 'elasticsearch': {}, > > > >>>>>>> 'hdfs': {} > > > >>>>>>> } > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> < > > > >>>>>> https://gist.github.com/nickwallen/ > 489735b65cdb38aae6e45cec7633a0 > > > >>>>>> a1#writer-non-specific-case>Writer-non-specific > > > >>>>>> > > > >>>>>>> Case > > > >>>>>>> > > > >>>>>>> - There are no global overrides, as in Casey's proposal. > > > >>>>>>> - Easier to grok IMO. > > > >>>>>>> > > > >>>>>>> { > > > >>>>>>> 'elasticsearch': { > > > >>>>>>> 'index': 'foo', > > > >>>>>>> 'batchSize': 100 > > > >>>>>>> }, > > > >>>>>>> 'hdfs': { > > > >>>>>>> 'index': 'foo', > > > >>>>>>> 'batchSize': 100 > > > >>>>>>> } > > > >>>>>>> } > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> < > > > >>>>>> https://gist.github.com/nickwallen/ > 489735b65cdb38aae6e45cec7633a0 > > > >>>>>> a1#writer-specific-case-without-filters>Writer-specific > > > >>>>>> > > > >>>>>>> case without filters > > > >>>>>>> > > > >>>>>>> { > > > >>>>>>> 'elasticsearch': { > > > >>>>>>> 'index': 'foo', > > > >>>>>>> 'batchSize': 1 > > > >>>>>>> }, > > > >>>>>>> 'hdfs': { > > > >>>>>>> 'index': 'foo', > > > >>>>>>> 'batchSize': 100 > > > >>>>>>> } > > > >>>>>>> } > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> < > > > >>>>>> https://gist.github.com/nickwallen/ > 489735b65cdb38aae6e45cec7633a0 > > > >>>>>> a1#writer-specific-case-with-filters>Writer-specific > > > >>>>>> > > > >>>>>>> case with filters > > > >>>>>>> > > > >>>>>>> - Instead of having to say when=false, just don't configure > HDFS > > > >>>>>>> > > > >>>>>>> { > > > >>>>>>> 'elasticsearch': { > > > >>>>>>> 'index': 'foo', > > > >>>>>>> 'batchSize': 100, > > > >>>>>>> 'when': 'exists(field1)' > > > >>>>>>> } > > > >>>>>>> } > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella < > > > >> ceste...@gmail.com > > > >>>> > > > >>>>>> wrote: > > > >>>>>>> > > > >>>>>>>> Dave, > > > >>>>>>>> For the benefit of posterity and people who might not be as > > > >> deeply > > > >>>>>>>> entangled in the system as we have been, I'll recap things and > > > >>>>> hopefully > > > >>>>>>>> answer your question in the process. > > > >>>>>>>> > > > >>>>>>>> Historically the index configuration is split between the > > > >>> enrichment > > > >>>>>>>> configs and the global configs. > > > >>>>>>>> > > > >>>>>>>> - The global configs really controls configs that apply to all > > > >>>>> sensors. > > > >>>>>>>> Historically this has been stuff like index connection > strings, > > > >>> etc. > > > >>>>>>>> - The sensor-specific configs which control things that vary > by > > > >>>>> sensor. > > > >>>>>>>> > > > >>>>>>>> As of Metron-652 (in review currently), we moved the sensor > > > >>> specific > > > >>>>>>>> configs from the enrichment configs. The proposal here is to > > > >>>> increase > > > >>>>>> the > > > >>>>>>>> granularity of the the sensor specific files to make them > > > >> support > > > >>>>> index > > > >>>>>>>> writer-specific configs. Right now in the indexing topology, > we > > > >>>> have 2 > > > >>>>>>>> writers (fixed): ES/Solr and HDFS. > > > >>>>>>>> > > > >>>>>>>> The proposed configuration would allow you to either specify a > > > >>>> blanket > > > >>>>>>>> sensor-level config for the index name and batchSize and/or > > > >>> override > > > >>>>> at > > > >>>>>>>> the > > > >>>>>>>> writer level, thereby supporting a couple of use-cases: > > > >>>>>>>> > > > >>>>>>>> - Turning off certain index writers (e.g. HDFS) > > > >>>>>>>> - Filtering the messages written to certain index writers > > > >>>>>>>> > > > >>>>>>>> The two competing configs between Nick and I are as follows: > > > >>>>>>>> > > > >>>>>>>> - I want to make sure we keep the old sensor-specific defaults > > > >>> with > > > >>>>>>>> writer-specific overrides available > > > >>>>>>>> - Nick thought we could simplify the permutations by making > the > > > >>>>>>>> indexing > > > >>>>>>>> config only the writer-level configs. > > > >>>>>>>> > > > >>>>>>>> My concerns about Nick's suggestion were that the default and > > > >>>> majority > > > >>>>>>>> case, specifying the index and the batchSize for all writers > (th > > > >>>> eone > > > >>>>> we > > > >>>>>>>> support now) would require more configuration. > > > >>>>>>>> > > > >>>>>>>> Nick's concerns about my suggestion were that it was overly > > > >>> complex > > > >>>>> and > > > >>>>>>>> hard to grok and that we could dispense with backwards > > > >>> compatibility > > > >>>>> and > > > >>>>>>>> make people do a bit more work on the default case for the > > > >>> benefits > > > >>>>> of a > > > >>>>>>>> simpler advanced case. (Nick, make sure I don't misstate your > > > >>>>> position). > > > >>>>>>>> > > > >>>>>>>> Casey > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle < > > > >>> dlyle65...@gmail.com> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Casey, > > > >>>>>>>>> > > > >>>>>>>>> Can you give me a level set of what your thinking is now? I > > > >>> think > > > >>>>> it's > > > >>>>>>>>> global control of all index types + overrides on a per-type > > > >>> basis. > > > >>>>>> Fwiw, > > > >>>>>>>>> I'm totally for that, but I want to make sure I'm not > imposing > > > >>> my > > > >>>>>>>>> pre-concieved notions on your consensus-driven ones. > > > >>>>>>>>> > > > >>>>>>>>> -D.... > > > >>>>>>>>> > > > >>>>>>>>> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella < > > > >>>> ceste...@gmail.com> > > > >>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> I am suggesting that, yes. The configs are essentially the > > > >>> same > > > >>>> as > > > >>>>>>>>> yours, > > > >>>>>>>>>> except there is an override specified at the top level. > > > >>> Without > > > >>>>>>>> that, in > > > >>>>>>>>>> order to specify both HDFS and ES have batch sizes of 100, > > > >> you > > > >>>>> have > > > >>>>>> to > > > >>>>>>>>>> explicitly configure each. It's less that I'm trying to have > > > >>>>>>>> backwards > > > >>>>>>>>>> compatibility and more that I'm trying to make the majority > > > >>> case > > > >>>>>> easy: > > > >>>>>>>>> both > > > >>>>>>>>>> writers write everything to a specified index name with a > > > >>>>> specified > > > >>>>>>>> batch > > > >>>>>>>>>> size (which is what we have now). Beyond that, I want to > > > >> allow > > > >>>> for > > > >>>>>>>>>> specifying an override for the config on a writer-by-writer > > > >>>> basis > > > >>>>>> for > > > >>>>>>>>> those > > > >>>>>>>>>> who need it. > > > >>>>>>>>>> > > > >>>>>>>>>> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen < > > > >>>> n...@nickallen.org> > > > >>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Are you saying we support all of these variants? I realize > > > >>> you > > > >>>>> are > > > >>>>>>>>>> trying > > > >>>>>>>>>>> to have some backwards compatibility, but this also makes > > > >> it > > > >>>>>> harder > > > >>>>>>>>> for a > > > >>>>>>>>>>> user to grok (for me at least). > > > >>>>>>>>>>> > > > >>>>>>>>>>> Personally I like my original example as there are fewer > > > >>>>>>>>> sub-structures, > > > >>>>>>>>>>> like 'writerConfig', which makes the whole thing simpler > > > >> and > > > >>>>>> easier > > > >>>>>>>> to > > > >>>>>>>>>>> grok. But maybe others will think your proposal is just as > > > >>>> easy > > > >>>>> to > > > >>>>>>>>> grok. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella < > > > >>>>>> ceste...@gmail.com> > > > >>>>>> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Ok, so here's what I'm thinking based on the discussion: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> - Keeping the configs that we have now (batchSize and > > > >>> index) > > > >>>>> as > > > >>>>>>>>>>> defaults > > > >>>>>>>>>>>> for the unspecified writer-specific case > > > >>>>>>>>>>>> - Adding the config Nick suggested > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> *Base Case*: > > > >>>>>>>>>>>> { > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> - all writers write all messages > > > >>>>>>>>>>>> - index named the same as the sensor for all writers > > > >>>>>>>>>>>> - batchSize of 1 for all writers > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> *Writer-non-specific case*: > > > >>>>>>>>>>>> { > > > >>>>>>>>>>>> "index" : "foo" > > > >>>>>>>>>>>> ,"batchSize" : 100 > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> - All writers write all messages > > > >>>>>>>>>>>> - index is named "foo", different from the sensor for > > > >> all > > > >>>>>>>> writers > > > >>>>>>>>>>>> - batchSize is 100 for all writers > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> *Writer-specific case without filters* > > > >>>>>>>>>>>> { > > > >>>>>>>>>>>> "index" : "foo" > > > >>>>>>>>>>>> ,"batchSize" : 1 > > > >>>>>>>>>>>> , "writerConfig" : > > > >>>>>>>>>>>> { > > > >>>>>>>>>>>> "elasticsearch" : { > > > >>>>>>>>>>>> "batchSize" : 100 > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> - All writers write all messages > > > >>>>>>>>>>>> - index is named "foo", different from the sensor for > > > >> all > > > >>>>>>>> writers > > > >>>>>>>>>>>> - batchSize is 1 for HDFS and 100 for elasticsearch > > > >>> writers > > > >>>>>>>>>>>> - NOTE: I could override the index name too > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> *Writer-specific case with filters* > > > >>>>>>>>>>>> { > > > >>>>>>>>>>>> "index" : "foo" > > > >>>>>>>>>>>> ,"batchSize" : 1 > > > >>>>>>>>>>>> , "writerConfig" : > > > >>>>>>>>>>>> { > > > >>>>>>>>>>>> "elasticsearch" : { > > > >>>>>>>>>>>> "batchSize" : 100, > > > >>>>>>>>>>>> "when" : "exists(field1)" > > > >>>>>>>>>>>> }, > > > >>>>>>>>>>>> "hdfs" : { > > > >>>>>>>>>>>> "when" : "false" > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> } > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> - ES writer writes messages which have field1, HDFS > > > >>> doesn't > > > >>>>>>>>>>>> - index is named "foo", different from the sensor for > > > >> all > > > >>>>>>>> writers > > > >>>>>>>>>>>> - 100 for elasticsearch writers > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Thoughts? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby < > > > >>>>>>>> cd...@hortonworks.com > > > >>>>>>>>>> > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> For larger installations you need to control what is > > > >>>> indexed > > > >>>>>> so > > > >>>>>>>> you > > > >>>>>>>>>>> don’t > > > >>>>>>>>>>>>> end up with a nasty elastic search situation and so > > > >> you > > > >>>> can > > > >>>>>> mine > > > >>>>>>>>> the > > > >>>>>>>>>>> data > > > >>>>>>>>>>>>> later for reports and training ml models. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Thanks > > > >>>>>>>>>>>>> Carolyn > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On 1/13/17, 9:40 AM, "Casey Stella" < > > > >> ceste...@gmail.com > > > >>>> > > > >>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> OH that's a good idea! > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen < > > > >>>>>>>> n...@nickallen.org> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I like the "Index Filtering" option based on the > > > >>>>>> flexibility > > > >>>>>>>>> that > > > >>>>>>>>>> it > > > >>>>>>>>>>>>>>> provides. Should each output (HDFS, ES, etc) have > > > >> its > > > >>>> own > > > >>>>>>>>>>>> configuration > > > >>>>>>>>>>>>>>> settings? For example, aren't things like batching > > > >>>>> handled > > > >>>>>>>>>>> separately > > > >>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>> HDFS versus Elasticsearch? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Something along the lines of... > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> { > > > >>>>>>>>>>>>>>> "hdfs" : { > > > >>>>>>>>>>>>>>> "when": "exists(field1)", > > > >>>>>>>>>>>>>>> "batchSize": 100 > > > >>>>>>>>>>>>>>> }, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> "elasticsearch" : { > > > >>>>>>>>>>>>>>> "when": "true", > > > >>>>>>>>>>>>>>> "batchSize": 1000, > > > >>>>>>>>>>>>>>> "index": "squid" > > > >>>>>>>>>>>>>>> } > > > >>>>>>>>>>>>>>> } > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella < > > > >>>>>>>>> ceste...@gmail.com > > > >>>>>>>>>>> > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Yeah, I tend to like the first option too. Any > > > >>>>> opposition > > > >>>>>>>> to > > > >>>>>>>>>> that > > > >>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>> anyone? > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> The points brought up are good ones and I think > > > >>> that > > > >>>> it > > > >>>>>>>> may be > > > >>>>>>>>>>>> worth a > > > >>>>>>>>>>>>>>>> broader discussion of the requirements of > > > >> indexing > > > >>>> in a > > > >>>>>>>>> separate > > > >>>>>>>>>>> dev > > > >>>>>>>>>>>>> list > > > >>>>>>>>>>>>>>>> thread. Maybe a list of desires with coherent > > > >>>> use-cases > > > >>>>>>>>>>> justifying > > > >>>>>>>>>>>>> them > > > >>>>>>>>>>>>>>> so > > > >>>>>>>>>>>>>>>> we can think about how this stuff should work and > > > >>>> where > > > >>>>>> the > > > >>>>>>>>>>> natural > > > >>>>>>>>>>>>>>>> extension points should be. Afterall, we need to > > > >>> toe > > > >>>>> the > > > >>>>>>>> line > > > >>>>>>>>>>>> between > > > >>>>>>>>>>>>>>>> engineering and overengineering for features > > > >> nobody > > > >>>>> will > > > >>>>>>>> want. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> I'm not sure about the extensions to the standard > > > >>>>> fields. > > > >>>>>>>> I'm > > > >>>>>>>>>>> torn > > > >>>>>>>>>>>>>>> between > > > >>>>>>>>>>>>>>>> the notions that we should have no standard > > > >> fields > > > >>> vs > > > >>>>> we > > > >>>>>>>>> should > > > >>>>>>>>>>>> have a > > > >>>>>>>>>>>>>>>> boatload of standard fields (with most of them > > > >>>> empty). > > > >>>>> I > > > >>>>>>>>>> exchange > > > >>>>>>>>>>>>>>>> positions fairly regularly on that question. ;) > > > >> It > > > >>>> may > > > >>>>> be > > > >>>>>>>>>> worth a > > > >>>>>>>>>>>> dev > > > >>>>>>>>>>>>>>> list > > > >>>>>>>>>>>>>>>> discussion to lay out how you imagine an > > > >> extension > > > >>> of > > > >>>>>>>> standard > > > >>>>>>>>>>>> fields > > > >>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>> how it might look as implemented in Metron. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Casey > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> Casey > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson > > > >> < > > > >>>>>>>>>>>>>>>> kylerichards...@gmail.com> > > > >>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> I'll second my preference for the first > > > >> option. I > > > >>>>> think > > > >>>>>>>> the > > > >>>>>>>>>>>> ability > > > >>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>> use > > > >>>>>>>>>>>>>>>>> Stellar filters to customize indexing would be > > > >> a > > > >>>> big > > > >>>>>> win. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> I'm glad Matt brought up the point about data > > > >>> lake > > > >>>>> and > > > >>>>>>>> CEP. > > > >>>>>>>>> I > > > >>>>>>>>>>>> think > > > >>>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>> a really important use case that we need to > > > >>>> consider. > > > >>>>>>>> Take a > > > >>>>>>>>>>>> simple > > > >>>>>>>>>>>>>>>>> example... If I have data coming in from 3 > > > >>>> different > > > >>>>>>>>> firewall > > > >>>>>>>>>>>>> vendors > > > >>>>>>>>>>>>>>>> and 2 > > > >>>>>>>>>>>>>>>>> different web proxy/url filtering vendors and I > > > >>>> want > > > >>>>> to > > > >>>>>>>> be > > > >>>>>>>>>> able > > > >>>>>>>>>>> to > > > >>>>>>>>>>>>>>>> analyze > > > >>>>>>>>>>>>>>>>> that data set, I need the data to be indexed > > > >> all > > > >>>>>> together > > > >>>>>>>>>>> (likely > > > >>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>> HDFS) > > > >>>>>>>>>>>>>>>>> and to have a normalized schema such that IP > > > >>>> address, > > > >>>>>>>> URL, > > > >>>>>>>>> and > > > >>>>>>>>>>>> user > > > >>>>>>>>>>>>>>> name > > > >>>>>>>>>>>>>>>>> (to take a few) can be easily queried and > > > >>>>> aggregated. I > > > >>>>>>>> can > > > >>>>>>>>>> also > > > >>>>>>>>>>>>>>> envision > > > >>>>>>>>>>>>>>>>> scenarios where I would want to index data > > > >> based > > > >>> on > > > >>>>>>>>> attributes > > > >>>>>>>>>>>> other > > > >>>>>>>>>>>>>>> than > > > >>>>>>>>>>>>>>>>> sensor, business unit or subsidiary for > > > >> example. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> I've been wanted to propose extending our 7 > > > >>>> standard > > > >>>>>>>> fields > > > >>>>>>>>> to > > > >>>>>>>>>>>>> include > > > >>>>>>>>>>>>>>>>> things like URL and user. Is there community > > > >>>>>>>>> interest/support > > > >>>>>>>>>>> for > > > >>>>>>>>>>>>>>> moving > > > >>>>>>>>>>>>>>>> in > > > >>>>>>>>>>>>>>>>> that direction? If so, I'll start a new thread. > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> Thanks! > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> -Kyle > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley < > > > >>>>>>>>> ma...@apache.org > > > >>>>>>>>>>> > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Ah, I see. If overriding the default index > > > >> name > > > >>>>>> allows > > > >>>>>>>>>> using > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> same > > > >>>>>>>>>>>>>>>>>> name for multiple sensors, then the goal can > > > >> be > > > >>>>>>>> achieved. > > > >>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>> --Matt > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> On 1/12/17, 3:30 PM, "Casey Stella" < > > > >>>>>>>> ceste...@gmail.com> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> Oh, you could! Let's say you have a syslog > > > >>> parser > > > >>>>>>>>> with > > > >>>>>>>>>>> data > > > >>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>> sources 1 > > > >>>>>>>>>>>>>>>>>> 2 and 3. You'd end up with one kafka queue > > > >>> with 3 > > > >>>>>>>>>> parsers > > > >>>>>>>>>>>>>>> attached > > > >>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>> queue, each picking part the messages from > > > >>> source > > > >>>>>>>> 1, 2 > > > >>>>>>>>>> and > > > >>>>>>>>>>>> 3. > > > >>>>>>>>>>>>>>>> They'd > > > >>>>>>>>>>>>>>>>>> go > > > >>>>>>>>>>>>>>>>>> through separate enrichment and into the > > > >>> indexing > > > >>>>>>>>>>> topology. > > > >>>>>>>>>>>>> In > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> indexing topology, you could specify the same > > > >>>> index > > > >>>>>>>>> name > > > >>>>>>>>>>>>> "syslog" > > > >>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>> all > > > >>>>>>>>>>>>>>>>>> of the messages go into the same index for > > > >> CEP > > > >>>>>>>>> querying > > > >>>>>>>>>> if > > > >>>>>>>>>>>> so > > > >>>>>>>>>>>>>>>>> desired. > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley < > > > >>>>>>>>>>>> ma...@apache.org > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> Syslog is hell on parsers – I know, I > > > >> worked > > > >>> at > > > >>>>>>>>>> LogLogic > > > >>>>>>>>>>>> in > > > >>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>> previous > > > >>>>>>>>>>>>>>>>>>> life. It makes perfect sense to route > > > >>> different > > > >>>>>>>>> lines > > > >>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>> syslog > > > >>>>>>>>>>>>>>>>>> through > > > >>>>>>>>>>>>>>>>>>> different appropriate parsers. But a lot of > > > >>>> what > > > >>>>>>>>> the > > > >>>>>>>>>>>>> parsers > > > >>>>>>>>>>>>>>> do > > > >>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>>> identify consistent subsets of metadata and > > > >>>>>>>> annotate > > > >>>>>>>>>> it > > > >>>>>>>>>>> – > > > >>>>>>>>>>>>> eg, > > > >>>>>>>>>>>>>>>>>> src_ip_addr, > > > >>>>>>>>>>>>>>>>>>> event timestamps, etc. Once those metadata > > > >>> are > > > >>>>>>>>>>> annotated > > > >>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>> available > > > >>>>>>>>>>>>>>>>>>> with common field names, why doesn’t it > > > >> make > > > >>>>>>>> sense > > > >>>>>>>>> to > > > >>>>>>>>>>>> index > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> messages > > > >>>>>>>>>>>>>>>>>>> together, for CEP querying? I think Splunk > > > >>> has > > > >>>>>>>>>>>> illustrated > > > >>>>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>>>> model. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> On 1/12/17, 3:00 PM, "Casey Stella" < > > > >>>>>>>>>> ceste...@gmail.com > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> yeah, I mean, honestly, I think the > > > >> approach > > > >>>>>>>>> that > > > >>>>>>>>>>>> we've > > > >>>>>>>>>>>>>>> taken > > > >>>>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>>> sources > > > >>>>>>>>>>>>>>>>>>> which aggregate different types of data is > > > >> to > > > >>>>>>>>>>> provide > > > >>>>>>>>>>>>>>> filters > > > >>>>>>>>>>>>>>>>> at > > > >>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> parser > > > >>>>>>>>>>>>>>>>>>> level and have multiple parser topologies > > > >>>>>>>> (with > > > >>>>>>>>>>>>> different, > > > >>>>>>>>>>>>>>>>>> possibly > > > >>>>>>>>>>>>>>>>>>> mutually exclusive filters) running. This > > > >>>>>>>> would > > > >>>>>>>>>> be > > > >>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>> completely > > > >>>>>>>>>>>>>>>>>>> separate > > > >>>>>>>>>>>>>>>>>>> sensor. Imagine a syslog data source that > > > >>>>>>>>>>> aggregates > > > >>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>> you > > > >>>>>>>>>>>>>>>>>> want to > > > >>>>>>>>>>>>>>>>>>> pick > > > >>>>>>>>>>>>>>>>>>> apart certain pieces of messages. This is > > > >>>>>>>> why > > > >>>>>>>>> the > > > >>>>>>>>>>>>> initial > > > >>>>>>>>>>>>>>>>>> thought and > > > >>>>>>>>>>>>>>>>>>> architecture was one index per sensor. > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 5:55 PM, Matt > > > >> Foley < > > > >>>>>>>>>>>>>>>> ma...@apache.org> > > > >>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> I’m thinking that CEP (Complex Event > > > >>>>>>>>> Processing) > > > >>>>>>>>>>> is > > > >>>>>>>>>>>>>>>> contrary > > > >>>>>>>>>>>>>>>>>> to the > > > >>>>>>>>>>>>>>>>>>> idea > > > >>>>>>>>>>>>>>>>>>>> of silo-ing data per sensor. > > > >>>>>>>>>>>>>>>>>>>> Now it’s true that some of those sensors > > > >>>>>>>> are > > > >>>>>>>>>>> already > > > >>>>>>>>>>>>>>>>>> aggregating > > > >>>>>>>>>>>>>>>>>>> data from > > > >>>>>>>>>>>>>>>>>>>> multiple sources, so maybe I’m wrong > > > >> here. > > > >>>>>>>>>>>>>>>>>>>> But it just seems to me that the “data > > > >>>>>>>> lake” > > > >>>>>>>>>>>> insights > > > >>>>>>>>>>>>>>> come > > > >>>>>>>>>>>>>>>>> from > > > >>>>>>>>>>>>>>>>>>> being able > > > >>>>>>>>>>>>>>>>>>>> to make decisions over the whole mass of > > > >>>>>>>> data > > > >>>>>>>>>>> rather > > > >>>>>>>>>>>>> than > > > >>>>>>>>>>>>>>>>> just > > > >>>>>>>>>>>>>>>>>>> vertical > > > >>>>>>>>>>>>>>>>>>>> slices of it. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> On 1/12/17, 2:15 PM, "Casey Stella" < > > > >>>>>>>>>>>>> ceste...@gmail.com> > > > >>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Hey Matt, > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> Thanks for the comment! > > > >>>>>>>>>>>>>>>>>>>> 1. At the moment, we only have one > > > >>>>>>>> index > > > >>>>>>>>>> name, > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> default > > > >>>>>>>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>>> which is > > > >>>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> sensor name but that's entirely up to > > > >>>>>>>> the > > > >>>>>>>>>>> user. > > > >>>>>>>>>>>>> This > > > >>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>> sensor > > > >>>>>>>>>>>>>>>>>>>> specific, > > > >>>>>>>>>>>>>>>>>>>> so it'd be a separate config for each > > > >>>>>>>>>> sensor. > > > >>>>>>>>>>>> If > > > >>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>> want > > > >>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>> build > > > >>>>>>>>>>>>>>>>>>>> multiple > > > >>>>>>>>>>>>>>>>>>>> indices per sensor, we'd have to think > > > >>>>>>>>>>> carefully > > > >>>>>>>>>>>>>>> about > > > >>>>>>>>>>>>>>>>> how > > > >>>>>>>>>>>>>>>>>> to do > > > >>>>>>>>>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>>>> would be a bigger undertaking. I > > > >>>>>>>> guess I > > > >>>>>>>>>> can > > > >>>>>>>>>>>> see > > > >>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>> use, > > > >>>>>>>>>>>>>>>>>> though > > > >>>>>>>>>>>>>>>>>>>> (redirect > > > >>>>>>>>>>>>>>>>>>>> messages to one index vs another based > > > >>>>>>>> on > > > >>>>>>>>> a > > > >>>>>>>>>>>>> predicate > > > >>>>>>>>>>>>>>>> for > > > >>>>>>>>>>>>>>>>>> a given > > > >>>>>>>>>>>>>>>>>>>> sensor). > > > >>>>>>>>>>>>>>>>>>>> Anyway, not where I was originally > > > >>>>>>>>> thinking > > > >>>>>>>>>>> that > > > >>>>>>>>>>>>> this > > > >>>>>>>>>>>>>>>>>> discussion > > > >>>>>>>>>>>>>>>>>>> would > > > >>>>>>>>>>>>>>>>>>>> go, > > > >>>>>>>>>>>>>>>>>>>> but it's an interesting point. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> 2. I hadn't thought through the > > > >>>>>>>>>> implementation > > > >>>>>>>>>>>>> quite > > > >>>>>>>>>>>>>>>> yet, > > > >>>>>>>>>>>>>>>>>> but we > > > >>>>>>>>>>>>>>>>>>> don't > > > >>>>>>>>>>>>>>>>>>>> actually have a splitter bolt in that > > > >>>>>>>>>>> topology, > > > >>>>>>>>>>>>> just > > > >>>>>>>>>>>>>>> a > > > >>>>>>>>>>>>>>>>>> spout > > > >>>>>>>>>>>>>>>>>>> that goes > > > >>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>> the elasticsearch writer and also to > > > >>>>>>>> the > > > >>>>>>>>>> hdfs > > > >>>>>>>>>>>>> writer. > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 4:52 PM, Matt > > > >>>>>>>>> Foley > > > >>>>>>>>>> < > > > >>>>>>>>>>>>>>>>>> ma...@apache.org> > > > >>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Casey, good to have controls like > > > >>>>>>>> this. > > > >>>>>>>>>>>> Couple > > > >>>>>>>>>>>>>>>>>> questions: > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> 1. Regarding the “index” : “squid” > > > >>>>>>>>>>> name/value > > > >>>>>>>>>>>>> pair, > > > >>>>>>>>>>>>>>>> is > > > >>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> index name > > > >>>>>>>>>>>>>>>>>>>>> expected to always be a sensor > > > >>>>>>>> name? Or > > > >>>>>>>>>> is > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>>> given > > > >>>>>>>>>>>>>>>>>> json > > > >>>>>>>>>>>>>>>>>>> structure > > > >>>>>>>>>>>>>>>>>>>>> subordinate to a sensor name in > > > >>>>>>>>> zookeeper? > > > >>>>>>>>>>> Or > > > >>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>> we > > > >>>>>>>>>>>>>>>>>> build > > > >>>>>>>>>>>>>>>>>>> arbitrary > > > >>>>>>>>>>>>>>>>>>>>> indexes with this new specification, > > > >>>>>>>>>>>>> independent of > > > >>>>>>>>>>>>>>>>>> sensor? > > > >>>>>>>>>>>>>>>>>>> Should > > > >>>>>>>>>>>>>>>>>>>> there > > > >>>>>>>>>>>>>>>>>>>>> actually be a list of “indexes”, ie > > > >>>>>>>>>>>>>>>>>>>>> { “indexes” : [ > > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name1”, > > > >>>>>>>>>>>>>>>>>>>>> … > > > >>>>>>>>>>>>>>>>>>>>> }, > > > >>>>>>>>>>>>>>>>>>>>> {“index” : “name2”, > > > >>>>>>>>>>>>>>>>>>>>> … > > > >>>>>>>>>>>>>>>>>>>>> } ] > > > >>>>>>>>>>>>>>>>>>>>> } > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> 2. Would the filtering / writer > > > >>>>>>>>> selection > > > >>>>>>>>>>>> logic > > > >>>>>>>>>>>>>>> take > > > >>>>>>>>>>>>>>>>>> place in > > > >>>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>>> indexing > > > >>>>>>>>>>>>>>>>>>>>> topology splitter bolt? Seems like > > > >>>>>>>> that > > > >>>>>>>>>>> would > > > >>>>>>>>>>>>> have > > > >>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>>> smallest > > > >>>>>>>>>>>>>>>>>>>> impact on > > > >>>>>>>>>>>>>>>>>>>>> current implementation, no? > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> Sorry if these are already answered > > > >>>>>>>> in > > > >>>>>>>>>>>> PR-415, I > > > >>>>>>>>>>>>>>>>> haven’t > > > >>>>>>>>>>>>>>>>>> had > > > >>>>>>>>>>>>>>>>>>> time to > > > >>>>>>>>>>>>>>>>>>>>> review that one yet. > > > >>>>>>>>>>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>>>>>>>>>> --Matt > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> On 1/12/17, 12:55 PM, "Michael > > > >>>>>>>>> Miklavcic" > > > >>>>>>>>>> < > > > >>>>>>>>>>>>>>>>>>>> michael.miklav...@gmail.com> > > > >>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> I like the flexibility and > > > >>>>>>>>>>> expressibility > > > >>>>>>>>>>>> of > > > >>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> first > > > >>>>>>>>>>>>>>>>>>> option > > > >>>>>>>>>>>>>>>>>>>> with > > > >>>>>>>>>>>>>>>>>>>>> Stellar > > > >>>>>>>>>>>>>>>>>>>>> filters. > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> M > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>> On Thu, Jan 12, 2017 at 1:51 PM, > > > >>>>>>>>> Casey > > > >>>>>>>>>>>>> Stella < > > > >>>>>>>>>>>>>>>>>>>> ceste...@gmail.com> > > > >>>>>>>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> As of METRON-652 < > > > >>>>>>>>>>>>> https://github.com/apache/ > > > >>>>>>>>>>>>>>>>>>>>> incubator-metron/pull/415>, we > > > >>>>>>>>>>>>>>>>>>>>>> will have decoupled the > > > >>>>>>>> indexing > > > >>>>>>>>>>>>>>> configuration > > > >>>>>>>>>>>>>>>>>> from the > > > >>>>>>>>>>>>>>>>>>>> enrichment > > > >>>>>>>>>>>>>>>>>>>>>> configuration. As an immediate > > > >>>>>>>>>>>> follow-up > > > >>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>> that, > > > >>>>>>>>>>>>>>>>>> I'd > > > >>>>>>>>>>>>>>>>>>> like to > > > >>>>>>>>>>>>>>>>>>>>> provide the > > > >>>>>>>>>>>>>>>>>>>>>> ability to turn off and on > > > >>>>>>>> writers > > > >>>>>>>>>> via > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> configs. I'd > > > >>>>>>>>>>>>>>>>>>> like > > > >>>>>>>>>>>>>>>>>>>> to get > > > >>>>>>>>>>>>>>>>>>>>> some > > > >>>>>>>>>>>>>>>>>>>>>> community feedback on how the > > > >>>>>>>>>>>>> functionality > > > >>>>>>>>>>>>>>>>> should > > > >>>>>>>>>>>>>>>>>> work, > > > >>>>>>>>>>>>>>>>>>> if > > > >>>>>>>>>>>>>>>>>>>> y'all are > > > >>>>>>>>>>>>>>>>>>>>>> amenable. :) > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> As of now, we have 3 possible > > > >>>>>>>>>> writers > > > >>>>>>>>>>>>> which > > > >>>>>>>>>>>>>>> can > > > >>>>>>>>>>>>>>>>> be > > > >>>>>>>>>>>>>>>>>> used > > > >>>>>>>>>>>>>>>>>>> in the > > > >>>>>>>>>>>>>>>>>>>>> indexing > > > >>>>>>>>>>>>>>>>>>>>>> topology: > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> - Solr > > > >>>>>>>>>>>>>>>>>>>>>> - Elasticsearch > > > >>>>>>>>>>>>>>>>>>>>>> - HDFS > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> HDFS is always used, > > > >>>>>>>> elasticsearch > > > >>>>>>>>>> or > > > >>>>>>>>>>>>> solr is > > > >>>>>>>>>>>>>>>>> used > > > >>>>>>>>>>>>>>>>>>> depending > > > >>>>>>>>>>>>>>>>>>>> on how > > > >>>>>>>>>>>>>>>>>>>>> you > > > >>>>>>>>>>>>>>>>>>>>>> start the indexing topology. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> A couple of proposals come to > > > >>>>>>>> mind > > > >>>>>>>>>>>>>>> immediately: > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> *Index Filtering* > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> You would be able to specify a > > > >>>>>>>>>> filter > > > >>>>>>>>>>> as > > > >>>>>>>>>>>>>>>> defined > > > >>>>>>>>>>>>>>>>>> by a > > > >>>>>>>>>>>>>>>>>>> stellar > > > >>>>>>>>>>>>>>>>>>>>> statement > > > >>>>>>>>>>>>>>>>>>>>>> (likely a reuse of the > > > >>>>>>>>> StellarFilter > > > >>>>>>>>>>>> that > > > >>>>>>>>>>>>>>>> exists > > > >>>>>>>>>>>>>>>>>> in the > > > >>>>>>>>>>>>>>>>>>>> Parsers) > > > >>>>>>>>>>>>>>>>>>>>> which > > > >>>>>>>>>>>>>>>>>>>>>> would allow you to indicate on > > > >>>>>>>> a > > > >>>>>>>>>>>>>>>>>> message-by-message basis > > > >>>>>>>>>>>>>>>>>>>> whether or > > > >>>>>>>>>>>>>>>>>>>>> not to > > > >>>>>>>>>>>>>>>>>>>>>> write the message. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> The semantics of this would be > > > >>>>>>>> as > > > >>>>>>>>>>>> follows: > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> - Default (i.e. > > > >>>>>>>> unspecified) is > > > >>>>>>>>>> to > > > >>>>>>>>>>>> pass > > > >>>>>>>>>>>>>>>>>> everything > > > >>>>>>>>>>>>>>>>>>> through > > > >>>>>>>>>>>>>>>>>>>> (hence > > > >>>>>>>>>>>>>>>>>>>>>> backwards compatible with > > > >>>>>>>> the > > > >>>>>>>>>>> current > > > >>>>>>>>>>>>>>>> default > > > >>>>>>>>>>>>>>>>>> config). > > > >>>>>>>>>>>>>>>>>>>>>> - Messages which have the > > > >>>>>>>>>>> associated > > > >>>>>>>>>>>>>>> stellar > > > >>>>>>>>>>>>>>>>>> statement > > > >>>>>>>>>>>>>>>>>>>> evaluate > > > >>>>>>>>>>>>>>>>>>>>> to true > > > >>>>>>>>>>>>>>>>>>>>>> for the writer type will be > > > >>>>>>>>>>> written, > > > >>>>>>>>>>>>>>>> otherwise > > > >>>>>>>>>>>>>>>>>> not. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which > > > >>>>>>>> would > > > >>>>>>>>>>> write > > > >>>>>>>>>>>>> out > > > >>>>>>>>>>>>>>> no > > > >>>>>>>>>>>>>>>>>> messages > > > >>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>> HDFS and > > > >>>>>>>>>>>>>>>>>>>>> write > > > >>>>>>>>>>>>>>>>>>>>>> out only messages containing a > > > >>>>>>>>> field > > > >>>>>>>>>>>>> called > > > >>>>>>>>>>>>>>>>>> "field1": > > > >>>>>>>>>>>>>>>>>>>>>> { > > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid" > > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100 > > > >>>>>>>>>>>>>>>>>>>>>> ,"filters" : { > > > >>>>>>>>>>>>>>>>>>>>>> "HDFS" : "false" > > > >>>>>>>>>>>>>>>>>>>>>> ,"ES" : "exists(field1)" > > > >>>>>>>>>>>>>>>>>>>>>> } > > > >>>>>>>>>>>>>>>>>>>>>> } > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> *Index On/Off Switch* > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> A simpler solution would be to > > > >>>>>>>>> just > > > >>>>>>>>>>>>> provide a > > > >>>>>>>>>>>>>>>>> list > > > >>>>>>>>>>>>>>>>>> of > > > >>>>>>>>>>>>>>>>>>> writers > > > >>>>>>>>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>>>>>>>> write > > > >>>>>>>>>>>>>>>>>>>>>> messages. The semantics would > > > >>>>>>>> be > > > >>>>>>>>> as > > > >>>>>>>>>>>>> follows: > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> - If the list is > > > >>>>>>>> unspecified, > > > >>>>>>>>>> then > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>> default > > > >>>>>>>>>>>>>>>>>> is to > > > >>>>>>>>>>>>>>>>>>> write > > > >>>>>>>>>>>>>>>>>>>> all > > > >>>>>>>>>>>>>>>>>>>>> messages > > > >>>>>>>>>>>>>>>>>>>>>> for every writer in the > > > >>>>>>>>> indexing > > > >>>>>>>>>>>>> topology > > > >>>>>>>>>>>>>>>>>>>>>> - If the list is specified, > > > >>>>>>>>> then > > > >>>>>>>>>> a > > > >>>>>>>>>>>>> writer > > > >>>>>>>>>>>>>>>> will > > > >>>>>>>>>>>>>>>>>> write > > > >>>>>>>>>>>>>>>>>>> all > > > >>>>>>>>>>>>>>>>>>>> messages > > > >>>>>>>>>>>>>>>>>>>>> if and > > > >>>>>>>>>>>>>>>>>>>>>> only if it is named in the > > > >>>>>>>>> list. > > > >>>>>>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>>>>>> Sample indexing config which > > > >>>>>>>> turns > > > >>>>>>>>>> off > > > >>>>>>>>>>>>> HDFS > > > >>>>>>>>>>>>>>> and > > > >>>>>>>>>>>>>>>>>> keeps on > > > >>>>>>>>>>>>>>>>>>>>> Elasticsearch: > > > >>>>>>>>>>>>>>>>>>>>>> { > > > >>>>>>>>>>>>>>>>>>>>>> "index" : "squid" > > > >>>>>>>>>>>>>>>>>>>>>> ,"batchSize" : 100 > > > >>>>>>>>>>>>>>>>>>>>>> ,"writers" : [ "ES" ] > > > >>> > > > >>> -- > > > >> > > > >> Jon > > > >> > > > >> Sent from my mobile device > > > >> > > > > > > > > > > > > -- > > Nick Allen <n...@nickallen.org> > > > -- Nick Allen <n...@nickallen.org>