One thing that I really like about Nick's suggestion is that it allows writer-specific configs in a clear and simple way. It is more complex for the default case (all writers write to indices named the same thing with a fixed batch size), which I do not like, but maybe it's worth the compromise to make it less complex for the advanced case.
Thanks a lot for the suggestion, Nick, it's interesting; I'm beginning to lean your way. On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com <zeo...@gmail.com> wrote: > I like the suggestions you made, Nick. The only thing I would add is that > it's also nice to see an explicit when(false), as people newer to the > platform may not know where to expect configs for the different writers. > Being able to do it either way, which I think is already assumed in your > model, would make sense. I would just suggest that, if we support but are > disabling a writer, that the platform inserts a default when(false) to be > explicit. > > Jon > > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella <ceste...@gmail.com> wrote: > > > Let me noodle on this over the weekend. Your syntax is looking less > > onerous to me and I like the following statement from Otto: "In the end, > > each write destination ‘type’ will need it’s own configuration. This is > an > > extension point." > > > > I may come around to your way of thinking. > > > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <ottobackwa...@gmail.com> > > wrote: > > > > > In the end, each write destination ‘type’ will need it’s own > > > configuration. This is an extension point. > > > { > > > HDFS:{ > > > outputAdapters:[ > > > {name: avro, > > > settings:{ > > > avro stuff…. > > > when:{ > > > }, > > > { > > > name: sequence file, > > > ….. > > > > > > or some such. > > > > > > > > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) > wrote: > > > > > > I will add also that instead of global overrides, like index, we should > > use > > > configuration key names that are more appropriate to the output. > > > > > > For example, does 'index' really make sense for HDFS? Or would 'path' > be > > > more appropriate? > > > > > > { > > > 'elasticsearch': { > > > 'index': 'foo', > > > 'batchSize': 1 > > > }, > > > 'hdfs': { > > > 'path': '/foo/bar/...', > > > 'batchSize': 100 > > > } > > > } > > > > > > Ok, I've said my peace. Thanks for the effort in summarizing all this, > > > Casey. > > > > > > > > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen <n...@nickallen.org> > wrote: > > > > > > > Nick's concerns about my suggestion were that it was overly complex > and > > > >> hard to grok and that we could dispense with backwards compatibility > > and > > > >> make people do a bit more work on the default case for the benefits > > of a > > > >> simpler advanced case. (Nick, make sure I don't misstate your > > position) > > > > > > > > > > > > I will add is that in my mind, the majority case would be a user > > > > specifying the outputs, but not things like 'batchSize' or 'when'. I > > > think > > > > in the majority case, the user would accept whatever the default > batch > > > size > > > > is. > > > > > > > > Here are alternatives suggestions for all the examples that you > > provided > > > > previously. > > > > > > > > Base Case > > > > > > > > - The user must always specify the 'outputs' for clarity. > > > > - Uses default index name, batch size and when = true. > > > > > > > > { > > > > 'elasticsearch': {}, > > > > 'hdfs': {} > > > > } > > > > > > > > > > > > < > > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0 > > > a1#writer-non-specific-case>Writer-non-specific > > > > > > > Case > > > > > > > > - There are no global overrides, as in Casey's proposal. > > > > - Easier to grok IMO. > > > > > > > > { > > > > 'elasticsearch': { > > > > 'index': 'foo', > > > > 'batchSize': 100 > > > > }, > > > > 'hdfs': { > > > > 'index': 'foo', > > > > 'batchSize': 100 > > > > } > > > > } > > > > > > > > > > > > < > > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0 > > > a1#writer-specific-case-without-filters>Writer-specific > > > > > > > case without filters > > > > > > > > { > > > > 'elasticsearch': { > > > > 'index': 'foo', > > > > 'batchSize': 1 > > > > }, > > > > 'hdfs': { > > > > 'index': 'foo', > > > > 'batchSize': 100 > > > > } > > > > } > > > > > > > > > > > > < > > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0 > > > a1#writer-specific-case-with-filters>Writer-specific > > > > > > > case with filters > > > > > > > > - Instead of having to say when=false, just don't configure HDFS > > > > > > > > { > > > > 'elasticsearch': { > > > > 'index': 'foo', > > > > 'batchSize': 100, > > > > 'when': 'exists(field1)' > > > > } > > > > } > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella <ceste...@gmail.com> > > > wrote: > > > > > > > >> Dave, > > > >> For the benefit of posterity and people who might not be as deeply > > > >> entangled in the system as we have been, I'll recap things and > > hopefully > > > >> answer your question in the process. > > > >> > > > >> Historically the index configuration is split between the enrichment > > > >> configs and the global configs. > > > >> > > > >> - The global configs really controls configs that apply to all > > sensors. > > > >> Historically this has been stuff like index connection strings, etc. > > > >> - The sensor-specific configs which control things that vary by > > sensor. > > > >> > > > >> As of Metron-652 (in review currently), we moved the sensor specific > > > >> configs from the enrichment configs. The proposal here is to > increase > > > the > > > >> granularity of the the sensor specific files to make them support > > index > > > >> writer-specific configs. Right now in the indexing topology, we > have 2 > > > >> writers (fixed): ES/Solr and HDFS. > > > >> > > > >> The proposed configuration would allow you to either specify a > blanket > > > >> sensor-level config for the index name and batchSize and/or override > > at > > > >> the > > > >> writer level, thereby supporting a couple of use-cases: > > > >> > > > >> - Turning off certain index writers (e.g. HDFS) > > > >> - Filtering the messages written to certain index writers > > > >> > > > >> The two competing configs between Nick and I are as follows: > > > >> > > > >> - I want to make sure we keep the old sensor-specific defaults with > > > >> writer-specific overrides available > > > >> - Nick thought we could simplify the permutations by making the > > > >> indexing > > > >> config only the writer-level configs. > > > >> > > > >> My concerns about Nick's suggestion were that the default and > majority > > > >> case, specifying the index and the batchSize for all writers (th > eone > > we > > > >> support now) would require more configuration. > > > >> > > > >> Nick's concerns about my suggestion were that it was overly complex > > and > > > >> hard to grok and that we could dispense with backwards compatibility > > and > > > >> make people do a bit more work on the default case for the benefits > > of a > > > >> simpler advanced case. (Nick, make sure I don't misstate your > > position). > > > >> > > > >> Casey > > > >> > > > >> > > > >> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle <dlyle65...@gmail.com> > > > >> wrote: > > > >> > > > >> > Casey, > > > >> > > > > >> > Can you give me a level set of what your thinking is now? I think > > it's > > > >> > global control of all index types + overrides on a per-type basis. > > > Fwiw, > > > >> > I'm totally for that, but I want to make sure I'm not imposing my > > > >> > pre-concieved notions on your consensus-driven ones. > > > >> > > > > >> > -D.... > > > >> > > > > >> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella < > ceste...@gmail.com> > > > >> wrote: > > > >> > > > > >> > > I am suggesting that, yes. The configs are essentially the same > as > > > >> > yours, > > > >> > > except there is an override specified at the top level. Without > > > >> that, in > > > >> > > order to specify both HDFS and ES have batch sizes of 100, you > > have > > > to > > > >> > > explicitly configure each. It's less that I'm trying to have > > > >> backwards > > > >> > > compatibility and more that I'm trying to make the majority case > > > easy: > > > >> > both > > > >> > > writers write everything to a specified index name with a > > specified > > > >> batch > > > >> > > size (which is what we have now). Beyond that, I want to allow > for > > > >> > > specifying an override for the config on a writer-by-writer > basis > > > for > > > >> > those > > > >> > > who need it. > > > >> > > > > > >> > > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen < > n...@nickallen.org> > > > >> wrote: > > > >> > > > > > >> > > > Are you saying we support all of these variants? I realize you > > are > > > >> > > trying > > > >> > > > to have some backwards compatibility, but this also makes it > > > harder > > > >> > for a > > > >> > > > user to grok (for me at least). > > > >> > > > > > > >> > > > Personally I like my original example as there are fewer > > > >> > sub-structures, > > > >> > > > like 'writerConfig', which makes the whole thing simpler and > > > easier > > > >> to > > > >> > > > grok. But maybe others will think your proposal is just as > easy > > to > > > >> > grok. > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella < > > > ceste...@gmail.com> > > > > > > >> > > wrote: > > > >> > > > > > > >> > > > > Ok, so here's what I'm thinking based on the discussion: > > > >> > > > > > > > >> > > > > - Keeping the configs that we have now (batchSize and index) > > as > > > >> > > > defaults > > > >> > > > > for the unspecified writer-specific case > > > >> > > > > - Adding the config Nick suggested > > > >> > > > > > > > >> > > > > *Base Case*: > > > >> > > > > { > > > >> > > > > } > > > >> > > > > > > > >> > > > > - all writers write all messages > > > >> > > > > - index named the same as the sensor for all writers > > > >> > > > > - batchSize of 1 for all writers > > > >> > > > > > > > >> > > > > *Writer-non-specific case*: > > > >> > > > > { > > > >> > > > > "index" : "foo" > > > >> > > > > ,"batchSize" : 100 > > > >> > > > > } > > > >> > > > > > > > >> > > > > - All writers write all messages > > > >> > > > > - index is named "foo", different from the sensor for all > > > >> writers > > > >> > > > > - batchSize is 100 for all writers > > > >> > > > > > > > >> > > > > *Writer-specific case without filters* > > > >> > > > > { > > > >> > > > > "index" : "foo" > > > >> > > > > ,"batchSize" : 1 > > > >> > > > > , "writerConfig" : > > > >> > > > > { > > > >> > > > > "elasticsearch" : { > > > >> > > > > "batchSize" : 100 > > > >> > > > > } > > > >> > > > > } > > > >> > > > > } > > > >> > > > > > > > >> > > > > - All writers write all messages > > > >> > > > > - index is named "foo", different from the sensor for all > > > >> writers > > > >> > > > > - batchSize is 1 for HDFS and 100 for elasticsearch writers > > > >> > > > > - NOTE: I could override the index name too > > > >> > > > > > > > >> > > > > *Writer-specific case with filters* > > > >> > > > > { > > > >> > > > > "index" : "foo" > > > >> > > > > ,"batchSize" : 1 > > > >> > > > > , "writerConfig" : > > > >> > > > > { > > > >> > > > > "elasticsearch" : { > > > >> > > > > "batchSize" : 100, > > > >> > > > > "when" : "exists(field1)" > > > >> > > > > }, > > > >> > > > > "hdfs" : { > > > >> > > > > "when" : "false" > > > >> > > > > } > > > >> > > > > } > > > >> > > > > } > > > >> > > > > > > > >> > > > > - ES writer writes messages which have field1, HDFS doesn't > > > >> > > > > - index is named "foo", different from the sensor for all > > > >> writers > > > >> > > > > - 100 for elasticsearch writers > > > >> > > > > > > > >> > > > > Thoughts? > > > >> > > > > > > > >> > > > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby < > > > >> cd...@hortonworks.com > > > >> > > > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > > > For larger installations you need to control what is > indexed > > > so > > > >> you > > > >> > > > don’t > > > >> > > > > > end up with a nasty elastic search situation and so you > can > > > mine > > > >> > the > > > >> > > > data > > > >> > > > > > later for reports and training ml models. > > > >> > > > > > > > > >> > > > > > Thanks > > > >> > > > > > Carolyn > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > On 1/13/17, 9:40 AM, "Casey Stella" <ceste...@gmail.com> > > > wrote: > > > >> > > > > > > > > >> > > > > > >OH that's a good idea! > > > >> > > > > > > > > > >> > > > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen < > > > >> n...@nickallen.org> > > > >> > > > wrote: > > > >> > > > > > > > > > >> > > > > > >> I like the "Index Filtering" option based on the > > > flexibility > > > >> > that > > > >> > > it > > > >> > > > > > >> provides. Should each output (HDFS, ES, etc) have its > own > > > >> > > > > configuration > > > >> > > > > > >> settings? For example, aren't things like batching > > handled > > > >> > > > separately > > > >> > > > > > for > > > >> > > > > > >> HDFS versus Elasticsearch? > > > >> > > > > > >> > > > >> > > > > > >> Something along the lines of... > > > >> > > > > > >> > > > >> > > > > > >> { > > > >> > > > > > >> "hdfs" : { > > > >> > > > > > >> "when": "exists(field1)", > > > >> > > > > > >> "batchSize": 100 > > > >> > > > > > >> }, > > > >> > > > > > >> > > > >> > > > > > >> "elasticsearch" : { > > > >> > > > > > >> "when": "true", > > > >> > > > > > >> "batchSize": 1000, > > > >> > > > > > >> "index": "squid" > > > >> > > > > > >> } > > > >> > > > > > >> } > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella < > > > >> > ceste...@gmail.com > > > >> > > > > > > >> > > > > > wrote: > > > >> > > > > > >> > > > >> > > > > > >> > Yeah, I tend to like the first option too. Any > > opposition > > > >> to > > > >> > > that > > > >> > > > > > from > > > >> > > > > > >> > anyone? > > > >> > > > > > >> > > > > >> > > > > > >> > The points brought up are good ones and I think that > it > > > >> may be > > > >> > > > > worth a > > > >> > > > > > >> > broader discussion of the requirements of indexing > in a > > > >> > separate > > > >> > > > dev > > > >> > > > > > list > > > >> > > > > > >> > thread. Maybe a list of desires with coherent > use-cases > > > >> > > > justifying > > > >> > > > > > them > > > >> > > > > > >> so > > > >> > > > > > >> > we can think about how this stuff should work and > where > > > the > > > >> > > > natural > > > >> > > > > > >> > extension points should be. Afterall, we need to toe > > the > > > >> line > > > >> > > > > between > > > >> > > > > > >> > engineering and overengineering for features nobody > > will > > > >> want. > > > >> > > > > > >> > > > > >> > > > > > >> > I'm not sure about the extensions to the standard > > fields. > > > >> I'm > > > >> > > > torn > > > >> > > > > > >> between > > > >> > > > > > >> > the notions that we should have no standard fields vs > > we > > > >> > should > > > >> > > > > have a > > > >> > > > > > >> > boatload of standard fields (with most of them > empty). > > I > > > >> > > exchange > > > >> > > > > > >> > positions fairly regularly on that question. ;) It > may > > be > > > >> > > worth a > > > >> > > > > dev > > > >> > > > > > >> list > > > >> > > > > > >> > discussion to lay out how you imagine an extension of > > > >> standard > > > >> > > > > fields > > > >> > > > > > and > > > >> > > > > > >> > how it might look as implemented in Metron. > > > >> > > > > > >> > > > > >> > > > > > >> > Casey > > > >> > > > > > >> > > > > >> > > > > > >> > Casey > > > >> > > > > > >> > > > > >> > > > > > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson < > > > >> > > > > > >> > kylerichards...@gmail.com> > > > >> > > > > > >> > wrote: > > > >> > > > > > >> > > > > >> > > > > > >> > > I'll second my preference for the first option. I > > think > > > >> the > > > >> > > > > ability > > > >> > > > > > to > > > >> > > > > > >> > use > > > >> > > > > > >> > > Stellar filters to customize indexing would be a > big > > > win. > > > >> > > > > > >> > > > > > >> > > > > > >> > > I'm glad Matt brought up the point about data lake > > and > > > >> CEP. > > > >> > I > > > >> > > > > think > > > >> > > > > > >> this > > > >> > > > > > >> > is > > > >> > > > > > >> > > a really important use case that we need to > consider. > > > >> Take a > > > >> > > > > simple > > > >> > > > > > >> > > example... If I have data coming in from 3 > different > > > >> > firewall > > > >> > > > > > vendors > > > >> > > > > > >> > and 2 > > > >> > > > > > >> > > different web proxy/url filtering vendors and I > want > > to > > > >> be > > > >> > > able > > > >> > > > to > > > >> > > > > > >> > analyze > > > >> > > > > > >> > > that data set, I need the data to be indexed all > > > together > > > >> > > > (likely > > > >> > > > > in > > > >> > > > > > >> > HDFS) > > > >> > > > > > >> > > and to have a normalized schema such that IP > address, > > > >> URL, > > > >> > and > > > >> > > > > user > > > >> > > > > > >> name > > > >> > > > > > >> > > (to take a few) can be easily queried and > > aggregated. I > > > >> can > > > >> > > also > > > >> > > > > > >> envision > > > >> > > > > > >> > > scenarios where I would want to index data based on > > > >> > attributes > > > >> > > > > other > > > >> > > > > > >> than > > > >> > > > > > >> > > sensor, business unit or subsidiary for example. > > > >> > > > > > >> > > > > > >> > > > > > >> > > I've been wanted to propose extending our 7 > standard > > > >> fields > > > >> > to > > > >> > > > > > include > > > >> > > > > > >> > > things like URL and user. Is there community > > > >> > interest/support > > > >> > > > for > > > >> > > > > > >> moving > > > >> > > > > > >> > in > > > >> > > > > > >> > > that direction? If so, I'll start a new thread. > > > >> > > > > > >> > > > > > >> > > > > > >> > > Thanks! > > > >> > > > > > >> > > > > > >> > > > > > >> > > -Kyle > > > >> > > > > > >> > > > > > >> > > > > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley < > > > >> > ma...@apache.org > > > >> > > > > > > >> > > > > > wrote: > > > >> > > > > > >> > > > > > >> > > > > > >> > > > Ah, I see. If overriding the default index name > > > allows > > > >> > > using > > > >> > > > > the > > > >> > > > > > >> same > > > >> > > > > > >> > > > name for multiple sensors, then the goal can be > > > >> achieved. > > > >> > > > > > >> > > > Thanks, > > > >> > > > > > >> > > > --Matt > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" < > > > >> ceste...@gmail.com> > > > >> > > > wrote: > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > Oh, you could! Let's say you have a syslog parser > > > >> > with > > > >> > > > data > > > >> > > > > > from > > > >> > > > > > >> > > > sources 1 > > > >> > > > > > >> > > > 2 and 3. You'd end up with one kafka queue with 3 > > > >> > > parsers > > > >> > > > > > >> attached > > > >> > > > > > >> > > to > > > >> > > > > > >> > > > that > > > >> > > > > > >> > > > queue, each picking part the messages from source > > > >> 1, 2 > > > >> > > and > > > >> > > > > 3. > > > >> > > > > > >> > They'd > > > >> > > > > > >> > > > go > > > >> > > > > > >> > > > through separate enrichment and into the indexing > > > >> > > > topology. > > > >> > > > > > In > > > >> > > > > > >> the > > > >> > > > > > >> > > > indexing topology, you could specify the same > index > > > >> > name > > > >> > > > > > "syslog" > > > >> > > > > > >> > and > > > >> > > > > > >> > > > all > > > >> > > > > > >> > > > of the messages go into the same index for CEP > > > >> > querying > > > >> > > if > > > >> > > > > so > > > >> > > > > > >> > > desired. > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley < > > > >> > > > > ma...@apache.org > > > >> > > > > > > > > > >> > > > > > >> > > wrote: > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > Syslog is hell on parsers – I know, I worked at > > > >> > > LogLogic > > > >> > > > > in > > > >> > > > > > a > > > >> > > > > > >> > > > previous > > > >> > > > > > >> > > > > life. It makes perfect sense to route different > > > >> > lines > > > >> > > > > from > > > >> > > > > > >> > syslog > > > >> > > > > > >> > > > through > > > >> > > > > > >> > > > > different appropriate parsers. But a lot of > what > > > >> > the > > > >> > > > > > parsers > > > >> > > > > > >> do > > > >> > > > > > >> > is > > > >> > > > > > >> > > > > identify consistent subsets of metadata and > > > >> annotate > > > >> > > it > > > >> > > > – > > > >> > > > > > eg, > > > >> > > > > > >> > > > src_ip_addr, > > > >> > > > > > >> > > > > event timestamps, etc. Once those metadata are > > > >> > > > annotated > > > >> > > > > > and > > > >> > > > > > >> > > > available > > > >> > > > > > >> > > > > with common field names, why doesn’t it make > > > >> sense > > > >> > to > > > >> > > > > index > > > >> > > > > > the > > > >> > > > > > >> > > > messages > > > >> > > > > > >> > > > > together, for CEP querying? I think Splunk has > > > >> > > > > illustrated > > > >> > > > > > >> this > > > >> > > > > > >> > > > model. > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" < > > > >> > > ceste...@gmail.com > > > >> > > > > > > > >> > > > > > >> wrote: > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > yeah, I mean, honestly, I think the approach > > > >> > that > > > >> > > > > we've > > > >> > > > > > >> taken > > > >> > > > > > >> > > for > > > >> > > > > > >> > > > > sources > > > >> > > > > > >> > > > > which aggregate different types of data is to > > > >> > > > provide > > > >> > > > > > >> filters > > > >> > > > > > >> > > at > > > >> > > > > > >> > > > the > > > >> > > > > > >> > > > > parser > > > >> > > > > > >> > > > > level and have multiple parser topologies > > > >> (with > > > >> > > > > > different, > > > >> > > > > > >> > > > possibly > > > >> > > > > > >> > > > > mutually exclusive filters) running. This > > > >> would > > > >> > > be > > > >> > > > a > > > >> > > > > > >> > > completely > > > >> > > > > > >> > > > > separate > > > >> > > > > > >> > > > > sensor. Imagine a syslog data source that > > > >> > > > aggregates > > > >> > > > > > and > > > >> > > > > > >> you > > > >> > > > > > >> > > > want to > > > >> > > > > > >> > > > > pick > > > >> > > > > > >> > > > > apart certain pieces of messages. This is > > > >> why > > > >> > the > > > >> > > > > > initial > > > >> > > > > > >> > > > thought and > > > >> > > > > > >> > > > > architecture was one index per sensor. > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley < > > > >> > > > > > >> > ma...@apache.org> > > > >> > > > > > >> > > > wrote: > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > I’m thinking that CEP (Complex Event > > > >> > Processing) > > > >> > > > is > > > >> > > > > > >> > contrary > > > >> > > > > > >> > > > to the > > > >> > > > > > >> > > > > idea > > > >> > > > > > >> > > > > > of silo-ing data per sensor. > > > >> > > > > > >> > > > > > Now it’s true that some of those sensors > > > >> are > > > >> > > > already > > > >> > > > > > >> > > > aggregating > > > >> > > > > > >> > > > > data from > > > >> > > > > > >> > > > > > multiple sources, so maybe I’m wrong here. > > > >> > > > > > >> > > > > > But it just seems to me that the “data > > > >> lake” > > > >> > > > > insights > > > >> > > > > > >> come > > > >> > > > > > >> > > from > > > >> > > > > > >> > > > > being able > > > >> > > > > > >> > > > > > to make decisions over the whole mass of > > > >> data > > > >> > > > rather > > > >> > > > > > than > > > >> > > > > > >> > > just > > > >> > > > > > >> > > > > vertical > > > >> > > > > > >> > > > > > slices of it. > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" < > > > >> > > > > > ceste...@gmail.com> > > > >> > > > > > >> > > > wrote: > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > Hey Matt, > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > Thanks for the comment! > > > >> > > > > > >> > > > > > 1. At the moment, we only have one > > > >> index > > > >> > > name, > > > >> > > > > the > > > >> > > > > > >> > > default > > > >> > > > > > >> > > > of > > > >> > > > > > >> > > > > which is > > > >> > > > > > >> > > > > > the > > > >> > > > > > >> > > > > > sensor name but that's entirely up to > > > >> the > > > >> > > > user. > > > >> > > > > > This > > > >> > > > > > >> > is > > > >> > > > > > >> > > > sensor > > > >> > > > > > >> > > > > > specific, > > > >> > > > > > >> > > > > > so it'd be a separate config for each > > > >> > > sensor. > > > >> > > > > If > > > >> > > > > > we > > > >> > > > > > >> > want > > > >> > > > > > >> > > > to > > > >> > > > > > >> > > > > build > > > >> > > > > > >> > > > > > multiple > > > >> > > > > > >> > > > > > indices per sensor, we'd have to think > > > >> > > > carefully > > > >> > > > > > >> about > > > >> > > > > > >> > > how > > > >> > > > > > >> > > > to do > > > >> > > > > > >> > > > > that > > > >> > > > > > >> > > > > > and > > > >> > > > > > >> > > > > > would be a bigger undertaking. I > > > >> guess I > > > >> > > can > > > >> > > > > see > > > >> > > > > > the > > > >> > > > > > >> > > use, > > > >> > > > > > >> > > > though > > > >> > > > > > >> > > > > > (redirect > > > >> > > > > > >> > > > > > messages to one index vs another based > > > >> on > > > >> > a > > > >> > > > > > predicate > > > >> > > > > > >> > for > > > >> > > > > > >> > > > a given > > > >> > > > > > >> > > > > > sensor). > > > >> > > > > > >> > > > > > Anyway, not where I was originally > > > >> > thinking > > > >> > > > that > > > >> > > > > > this > > > >> > > > > > >> > > > discussion > > > >> > > > > > >> > > > > would > > > >> > > > > > >> > > > > > go, > > > >> > > > > > >> > > > > > but it's an interesting point. > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > 2. I hadn't thought through the > > > >> > > implementation > > > >> > > > > > quite > > > >> > > > > > >> > yet, > > > >> > > > > > >> > > > but we > > > >> > > > > > >> > > > > don't > > > >> > > > > > >> > > > > > actually have a splitter bolt in that > > > >> > > > topology, > > > >> > > > > > just > > > >> > > > > > >> a > > > >> > > > > > >> > > > spout > > > >> > > > > > >> > > > > that goes > > > >> > > > > > >> > > > > > to > > > >> > > > > > >> > > > > > the elasticsearch writer and also to > > > >> the > > > >> > > hdfs > > > >> > > > > > writer. > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt > > > >> > Foley > > > >> > > < > > > >> > > > > > >> > > > ma...@apache.org> > > > >> > > > > > >> > > > > wrote: > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > Casey, good to have controls like > > > >> this. > > > >> > > > > Couple > > > >> > > > > > >> > > > questions: > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > 1. Regarding the “index” : “squid” > > > >> > > > name/value > > > >> > > > > > pair, > > > >> > > > > > >> > is > > > >> > > > > > >> > > > the > > > >> > > > > > >> > > > > index name > > > >> > > > > > >> > > > > > > expected to always be a sensor > > > >> name? Or > > > >> > > is > > > >> > > > > the > > > >> > > > > > >> given > > > >> > > > > > >> > > > json > > > >> > > > > > >> > > > > structure > > > >> > > > > > >> > > > > > > subordinate to a sensor name in > > > >> > zookeeper? > > > >> > > > Or > > > >> > > > > > can > > > >> > > > > > >> we > > > >> > > > > > >> > > > build > > > >> > > > > > >> > > > > arbitrary > > > >> > > > > > >> > > > > > > indexes with this new specification, > > > >> > > > > > independent of > > > >> > > > > > >> > > > sensor? > > > >> > > > > > >> > > > > Should > > > >> > > > > > >> > > > > > there > > > >> > > > > > >> > > > > > > actually be a list of “indexes”, ie > > > >> > > > > > >> > > > > > > { “indexes” : [ > > > >> > > > > > >> > > > > > > {“index” : “name1”, > > > >> > > > > > >> > > > > > > … > > > >> > > > > > >> > > > > > > }, > > > >> > > > > > >> > > > > > > {“index” : “name2”, > > > >> > > > > > >> > > > > > > … > > > >> > > > > > >> > > > > > > } ] > > > >> > > > > > >> > > > > > > } > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > 2. Would the filtering / writer > > > >> > selection > > > >> > > > > logic > > > >> > > > > > >> take > > > >> > > > > > >> > > > place in > > > >> > > > > > >> > > > > the > > > >> > > > > > >> > > > > > indexing > > > >> > > > > > >> > > > > > > topology splitter bolt? Seems like > > > >> that > > > >> > > > would > > > >> > > > > > have > > > >> > > > > > >> > the > > > >> > > > > > >> > > > > smallest > > > >> > > > > > >> > > > > > impact on > > > >> > > > > > >> > > > > > > current implementation, no? > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > Sorry if these are already answered > > > >> in > > > >> > > > > PR-415, I > > > >> > > > > > >> > > haven’t > > > >> > > > > > >> > > > had > > > >> > > > > > >> > > > > time to > > > >> > > > > > >> > > > > > > review that one yet. > > > >> > > > > > >> > > > > > > Thanks, > > > >> > > > > > >> > > > > > > --Matt > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael > > > >> > Miklavcic" > > > >> > > < > > > >> > > > > > >> > > > > > michael.miklav...@gmail.com> > > > >> > > > > > >> > > > > > > wrote: > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > I like the flexibility and > > > >> > > > expressibility > > > >> > > > > of > > > >> > > > > > >> the > > > >> > > > > > >> > > > first > > > >> > > > > > >> > > > > option > > > >> > > > > > >> > > > > > with > > > >> > > > > > >> > > > > > > Stellar > > > >> > > > > > >> > > > > > > filters. > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > M > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM, > > > >> > Casey > > > >> > > > > > Stella < > > > >> > > > > > >> > > > > > ceste...@gmail.com> > > > >> > > > > > >> > > > > > > wrote: > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > As of METRON-652 < > > > >> > > > > > https://github.com/apache/ > > > >> > > > > > >> > > > > > > incubator-metron/pull/415>, we > > > >> > > > > > >> > > > > > > > will have decoupled the > > > >> indexing > > > >> > > > > > >> configuration > > > >> > > > > > >> > > > from the > > > >> > > > > > >> > > > > > enrichment > > > >> > > > > > >> > > > > > > > configuration. As an immediate > > > >> > > > > follow-up > > > >> > > > > > to > > > >> > > > > > >> > > that, > > > >> > > > > > >> > > > I'd > > > >> > > > > > >> > > > > like to > > > >> > > > > > >> > > > > > > provide the > > > >> > > > > > >> > > > > > > > ability to turn off and on > > > >> writers > > > >> > > via > > > >> > > > > the > > > >> > > > > > >> > > > configs. I'd > > > >> > > > > > >> > > > > like > > > >> > > > > > >> > > > > > to get > > > >> > > > > > >> > > > > > > some > > > >> > > > > > >> > > > > > > > community feedback on how the > > > >> > > > > > functionality > > > >> > > > > > >> > > should > > > >> > > > > > >> > > > work, > > > >> > > > > > >> > > > > if > > > >> > > > > > >> > > > > > y'all are > > > >> > > > > > >> > > > > > > > amenable. :) > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > As of now, we have 3 possible > > > >> > > writers > > > >> > > > > > which > > > >> > > > > > >> can > > > >> > > > > > >> > > be > > > >> > > > > > >> > > > used > > > >> > > > > > >> > > > > in the > > > >> > > > > > >> > > > > > > indexing > > > >> > > > > > >> > > > > > > > topology: > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > - Solr > > > >> > > > > > >> > > > > > > > - Elasticsearch > > > >> > > > > > >> > > > > > > > - HDFS > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > HDFS is always used, > > > >> elasticsearch > > > >> > > or > > > >> > > > > > solr is > > > >> > > > > > >> > > used > > > >> > > > > > >> > > > > depending > > > >> > > > > > >> > > > > > on how > > > >> > > > > > >> > > > > > > you > > > >> > > > > > >> > > > > > > > start the indexing topology. > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > A couple of proposals come to > > > >> mind > > > >> > > > > > >> immediately: > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > *Index Filtering* > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > You would be able to specify a > > > >> > > filter > > > >> > > > as > > > >> > > > > > >> > defined > > > >> > > > > > >> > > > by a > > > >> > > > > > >> > > > > stellar > > > >> > > > > > >> > > > > > > statement > > > >> > > > > > >> > > > > > > > (likely a reuse of the > > > >> > StellarFilter > > > >> > > > > that > > > >> > > > > > >> > exists > > > >> > > > > > >> > > > in the > > > >> > > > > > >> > > > > > Parsers) > > > >> > > > > > >> > > > > > > which > > > >> > > > > > >> > > > > > > > would allow you to indicate on > > > >> a > > > >> > > > > > >> > > > message-by-message basis > > > >> > > > > > >> > > > > > whether or > > > >> > > > > > >> > > > > > > not to > > > >> > > > > > >> > > > > > > > write the message. > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > The semantics of this would be > > > >> as > > > >> > > > > follows: > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > - Default (i.e. > > > >> unspecified) is > > > >> > > to > > > >> > > > > pass > > > >> > > > > > >> > > > everything > > > >> > > > > > >> > > > > through > > > >> > > > > > >> > > > > > (hence > > > >> > > > > > >> > > > > > > > backwards compatible with > > > >> the > > > >> > > > current > > > >> > > > > > >> > default > > > >> > > > > > >> > > > config). > > > >> > > > > > >> > > > > > > > - Messages which have the > > > >> > > > associated > > > >> > > > > > >> stellar > > > >> > > > > > >> > > > statement > > > >> > > > > > >> > > > > > evaluate > > > >> > > > > > >> > > > > > > to true > > > >> > > > > > >> > > > > > > > for the writer type will be > > > >> > > > written, > > > >> > > > > > >> > otherwise > > > >> > > > > > >> > > > not. > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > Sample indexing config which > > > >> would > > > >> > > > write > > > >> > > > > > out > > > >> > > > > > >> no > > > >> > > > > > >> > > > messages > > > >> > > > > > >> > > > > to > > > >> > > > > > >> > > > > > HDFS and > > > >> > > > > > >> > > > > > > write > > > >> > > > > > >> > > > > > > > out only messages containing a > > > >> > field > > > >> > > > > > called > > > >> > > > > > >> > > > "field1": > > > >> > > > > > >> > > > > > > > { > > > >> > > > > > >> > > > > > > > "index" : "squid" > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100 > > > >> > > > > > >> > > > > > > > ,"filters" : { > > > >> > > > > > >> > > > > > > > "HDFS" : "false" > > > >> > > > > > >> > > > > > > > ,"ES" : "exists(field1)" > > > >> > > > > > >> > > > > > > > } > > > >> > > > > > >> > > > > > > > } > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > *Index On/Off Switch* > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > A simpler solution would be to > > > >> > just > > > >> > > > > > provide a > > > >> > > > > > >> > > list > > > >> > > > > > >> > > > of > > > >> > > > > > >> > > > > writers > > > >> > > > > > >> > > > > > to > > > >> > > > > > >> > > > > > > write > > > >> > > > > > >> > > > > > > > messages. The semantics would > > > >> be > > > >> > as > > > >> > > > > > follows: > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > - If the list is > > > >> unspecified, > > > >> > > then > > > >> > > > > the > > > >> > > > > > >> > default > > > >> > > > > > >> > > > is to > > > >> > > > > > >> > > > > write > > > >> > > > > > >> > > > > > all > > > >> > > > > > >> > > > > > > messages > > > >> > > > > > >> > > > > > > > for every writer in the > > > >> > indexing > > > >> > > > > > topology > > > >> > > > > > >> > > > > > > > - If the list is specified, > > > >> > then > > > >> > > a > > > >> > > > > > writer > > > >> > > > > > >> > will > > > >> > > > > > >> > > > write > > > >> > > > > > >> > > > > all > > > >> > > > > > >> > > > > > messages > > > >> > > > > > >> > > > > > > if and > > > >> > > > > > >> > > > > > > > only if it is named in the > > > >> > list. > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > Sample indexing config which > > > >> turns > > > >> > > off > > > >> > > > > > HDFS > > > >> > > > > > >> and > > > >> > > > > > >> > > > keeps on > > > >> > > > > > >> > > > > > > Elasticsearch: > > > >> > > > > > >> > > > > > > > { > > > >> > > > > > >> > > > > > > > "index" : "squid" > > > >> > > > > > >> > > > > > > > ,"batchSize" : 100 > > > >> > > > > > >> > > > > > > > ,"writers" : [ "ES" ] > > > >> > > > > > >> > > > > > > > } > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > Thanks in advance for the > > > >> > feedback! > > > >> > > > > > Also, if > > > >> > > > > > >> > you > > > >> > > > > > >> > > > have > > > >> > > > > > >> > > > > any > > > >> > > > > > >> > > > > > other, > > > >> > > > > > >> > > > > > > better > > > >> > > > > > >> > > > > > > > ideas than the ones presented > > > >> > here, > > > >> > > > let > > > >> > > > > me > > > >> > > > > > >> know > > > >> > > > > > >> > > > too. > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > Best, > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > Casey > > > >> > > > > > >> > > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> -- > > > >> > > > > > >> Nick Allen <n...@nickallen.org> > > > >> > > > > > >> > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > -- > > > >> > > > Nick Allen <n...@nickallen.org> > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > > > > -- > > > > Nick Allen <n...@nickallen.org> > > > > > > > > > > > > > > > > -- > > > Nick Allen <n...@nickallen.org> > > > > > > -- > > Jon > > Sent from my mobile device >