For larger installations you need to control what is indexed so you don’t end up with a nasty elastic search situation and so you can mine the data later for reports and training ml models.
Thanks Carolyn On 1/13/17, 9:40 AM, "Casey Stella" <ceste...@gmail.com> wrote: >OH that's a good idea! > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <n...@nickallen.org> wrote: > >> I like the "Index Filtering" option based on the flexibility that it >> provides. Should each output (HDFS, ES, etc) have its own configuration >> settings? For example, aren't things like batching handled separately for >> HDFS versus Elasticsearch? >> >> Something along the lines of... >> >> { >> "hdfs" : { >> "when": "exists(field1)", >> "batchSize": 100 >> }, >> >> "elasticsearch" : { >> "when": "true", >> "batchSize": 1000, >> "index": "squid" >> } >> } >> >> >> >> >> >> >> >> >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <ceste...@gmail.com> wrote: >> >> > Yeah, I tend to like the first option too. Any opposition to that from >> > anyone? >> > >> > The points brought up are good ones and I think that it may be worth a >> > broader discussion of the requirements of indexing in a separate dev list >> > thread. Maybe a list of desires with coherent use-cases justifying them >> so >> > we can think about how this stuff should work and where the natural >> > extension points should be. Afterall, we need to toe the line between >> > engineering and overengineering for features nobody will want. >> > >> > I'm not sure about the extensions to the standard fields. I'm torn >> between >> > the notions that we should have no standard fields vs we should have a >> > boatload of standard fields (with most of them empty). I exchange >> > positions fairly regularly on that question. ;) It may be worth a dev >> list >> > discussion to lay out how you imagine an extension of standard fields and >> > how it might look as implemented in Metron. >> > >> > Casey >> > >> > Casey >> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson < >> > kylerichards...@gmail.com> >> > wrote: >> > >> > > I'll second my preference for the first option. I think the ability to >> > use >> > > Stellar filters to customize indexing would be a big win. >> > > >> > > I'm glad Matt brought up the point about data lake and CEP. I think >> this >> > is >> > > a really important use case that we need to consider. Take a simple >> > > example... If I have data coming in from 3 different firewall vendors >> > and 2 >> > > different web proxy/url filtering vendors and I want to be able to >> > analyze >> > > that data set, I need the data to be indexed all together (likely in >> > HDFS) >> > > and to have a normalized schema such that IP address, URL, and user >> name >> > > (to take a few) can be easily queried and aggregated. I can also >> envision >> > > scenarios where I would want to index data based on attributes other >> than >> > > sensor, business unit or subsidiary for example. >> > > >> > > I've been wanted to propose extending our 7 standard fields to include >> > > things like URL and user. Is there community interest/support for >> moving >> > in >> > > that direction? If so, I'll start a new thread. >> > > >> > > Thanks! >> > > >> > > -Kyle >> > > >> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <ma...@apache.org> wrote: >> > > >> > > > Ah, I see. If overriding the default index name allows using the >> same >> > > > name for multiple sensors, then the goal can be achieved. >> > > > Thanks, >> > > > --Matt >> > > > >> > > > >> > > > On 1/12/17, 3:30 PM, "Casey Stella" <ceste...@gmail.com> wrote: >> > > > >> > > > Oh, you could! Let's say you have a syslog parser with data from >> > > > sources 1 >> > > > 2 and 3. You'd end up with one kafka queue with 3 parsers >> attached >> > > to >> > > > that >> > > > queue, each picking part the messages from source 1, 2 and 3. >> > They'd >> > > > go >> > > > through separate enrichment and into the indexing topology. In >> the >> > > > indexing topology, you could specify the same index name "syslog" >> > and >> > > > all >> > > > of the messages go into the same index for CEP querying if so >> > > desired. >> > > > >> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <ma...@apache.org> >> > > wrote: >> > > > >> > > > > Syslog is hell on parsers – I know, I worked at LogLogic in a >> > > > previous >> > > > > life. It makes perfect sense to route different lines from >> > syslog >> > > > through >> > > > > different appropriate parsers. But a lot of what the parsers >> do >> > is >> > > > > identify consistent subsets of metadata and annotate it – eg, >> > > > src_ip_addr, >> > > > > event timestamps, etc. Once those metadata are annotated and >> > > > available >> > > > > with common field names, why doesn’t it make sense to index the >> > > > messages >> > > > > together, for CEP querying? I think Splunk has illustrated >> this >> > > > model. >> > > > > >> > > > > On 1/12/17, 3:00 PM, "Casey Stella" <ceste...@gmail.com> >> wrote: >> > > > > >> > > > > yeah, I mean, honestly, I think the approach that we've >> taken >> > > for >> > > > > sources >> > > > > which aggregate different types of data is to provide >> filters >> > > at >> > > > the >> > > > > parser >> > > > > level and have multiple parser topologies (with different, >> > > > possibly >> > > > > mutually exclusive filters) running. This would be a >> > > completely >> > > > > separate >> > > > > sensor. Imagine a syslog data source that aggregates and >> you >> > > > want to >> > > > > pick >> > > > > apart certain pieces of messages. This is why the initial >> > > > thought and >> > > > > architecture was one index per sensor. >> > > > > >> > > > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley < >> > ma...@apache.org> >> > > > wrote: >> > > > > >> > > > > > I’m thinking that CEP (Complex Event Processing) is >> > contrary >> > > > to the >> > > > > idea >> > > > > > of silo-ing data per sensor. >> > > > > > Now it’s true that some of those sensors are already >> > > > aggregating >> > > > > data from >> > > > > > multiple sources, so maybe I’m wrong here. >> > > > > > But it just seems to me that the “data lake” insights >> come >> > > from >> > > > > being able >> > > > > > to make decisions over the whole mass of data rather than >> > > just >> > > > > vertical >> > > > > > slices of it. >> > > > > > >> > > > > > On 1/12/17, 2:15 PM, "Casey Stella" <ceste...@gmail.com> >> > > > wrote: >> > > > > > >> > > > > > Hey Matt, >> > > > > > >> > > > > > Thanks for the comment! >> > > > > > 1. At the moment, we only have one index name, the >> > > default >> > > > of >> > > > > which is >> > > > > > the >> > > > > > sensor name but that's entirely up to the user. This >> > is >> > > > sensor >> > > > > > specific, >> > > > > > so it'd be a separate config for each sensor. If we >> > want >> > > > to >> > > > > build >> > > > > > multiple >> > > > > > indices per sensor, we'd have to think carefully >> about >> > > how >> > > > to do >> > > > > that >> > > > > > and >> > > > > > would be a bigger undertaking. I guess I can see the >> > > use, >> > > > though >> > > > > > (redirect >> > > > > > messages to one index vs another based on a predicate >> > for >> > > > a given >> > > > > > sensor). >> > > > > > Anyway, not where I was originally thinking that this >> > > > discussion >> > > > > would >> > > > > > go, >> > > > > > but it's an interesting point. >> > > > > > >> > > > > > 2. I hadn't thought through the implementation quite >> > yet, >> > > > but we >> > > > > don't >> > > > > > actually have a splitter bolt in that topology, just >> a >> > > > spout >> > > > > that goes >> > > > > > to >> > > > > > the elasticsearch writer and also to the hdfs writer. >> > > > > > >> > > > > > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley < >> > > > ma...@apache.org> >> > > > > wrote: >> > > > > > >> > > > > > > Casey, good to have controls like this. Couple >> > > > questions: >> > > > > > > >> > > > > > > 1. Regarding the “index” : “squid” name/value pair, >> > is >> > > > the >> > > > > index name >> > > > > > > expected to always be a sensor name? Or is the >> given >> > > > json >> > > > > structure >> > > > > > > subordinate to a sensor name in zookeeper? Or can >> we >> > > > build >> > > > > arbitrary >> > > > > > > indexes with this new specification, independent of >> > > > sensor? >> > > > > Should >> > > > > > there >> > > > > > > actually be a list of “indexes”, ie >> > > > > > > { “indexes” : [ >> > > > > > > {“index” : “name1”, >> > > > > > > … >> > > > > > > }, >> > > > > > > {“index” : “name2”, >> > > > > > > … >> > > > > > > } ] >> > > > > > > } >> > > > > > > >> > > > > > > 2. Would the filtering / writer selection logic >> take >> > > > place in >> > > > > the >> > > > > > indexing >> > > > > > > topology splitter bolt? Seems like that would have >> > the >> > > > > smallest >> > > > > > impact on >> > > > > > > current implementation, no? >> > > > > > > >> > > > > > > Sorry if these are already answered in PR-415, I >> > > haven’t >> > > > had >> > > > > time to >> > > > > > > review that one yet. >> > > > > > > Thanks, >> > > > > > > --Matt >> > > > > > > >> > > > > > > >> > > > > > > On 1/12/17, 12:55 PM, "Michael Miklavcic" < >> > > > > > michael.miklav...@gmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > I like the flexibility and expressibility of >> the >> > > > first >> > > > > option >> > > > > > with >> > > > > > > Stellar >> > > > > > > filters. >> > > > > > > >> > > > > > > M >> > > > > > > >> > > > > > > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella < >> > > > > > ceste...@gmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > As of METRON-652 <https://github.com/apache/ >> > > > > > > incubator-metron/pull/415>, we >> > > > > > > > will have decoupled the indexing >> configuration >> > > > from the >> > > > > > enrichment >> > > > > > > > configuration. As an immediate follow-up to >> > > that, >> > > > I'd >> > > > > like to >> > > > > > > provide the >> > > > > > > > ability to turn off and on writers via the >> > > > configs. I'd >> > > > > like >> > > > > > to get >> > > > > > > some >> > > > > > > > community feedback on how the functionality >> > > should >> > > > work, >> > > > > if >> > > > > > y'all are >> > > > > > > > amenable. :) >> > > > > > > > >> > > > > > > > >> > > > > > > > As of now, we have 3 possible writers which >> can >> > > be >> > > > used >> > > > > in the >> > > > > > > indexing >> > > > > > > > topology: >> > > > > > > > >> > > > > > > > - Solr >> > > > > > > > - Elasticsearch >> > > > > > > > - HDFS >> > > > > > > > >> > > > > > > > HDFS is always used, elasticsearch or solr is >> > > used >> > > > > depending >> > > > > > on how >> > > > > > > you >> > > > > > > > start the indexing topology. >> > > > > > > > >> > > > > > > > A couple of proposals come to mind >> immediately: >> > > > > > > > >> > > > > > > > *Index Filtering* >> > > > > > > > >> > > > > > > > You would be able to specify a filter as >> > defined >> > > > by a >> > > > > stellar >> > > > > > > statement >> > > > > > > > (likely a reuse of the StellarFilter that >> > exists >> > > > in the >> > > > > > Parsers) >> > > > > > > which >> > > > > > > > would allow you to indicate on a >> > > > message-by-message basis >> > > > > > whether or >> > > > > > > not to >> > > > > > > > write the message. >> > > > > > > > >> > > > > > > > The semantics of this would be as follows: >> > > > > > > > >> > > > > > > > - Default (i.e. unspecified) is to pass >> > > > everything >> > > > > through >> > > > > > (hence >> > > > > > > > backwards compatible with the current >> > default >> > > > config). >> > > > > > > > - Messages which have the associated >> stellar >> > > > statement >> > > > > > evaluate >> > > > > > > to true >> > > > > > > > for the writer type will be written, >> > otherwise >> > > > not. >> > > > > > > > >> > > > > > > > >> > > > > > > > Sample indexing config which would write out >> no >> > > > messages >> > > > > to >> > > > > > HDFS and >> > > > > > > write >> > > > > > > > out only messages containing a field called >> > > > "field1": >> > > > > > > > { >> > > > > > > > "index" : "squid" >> > > > > > > > ,"batchSize" : 100 >> > > > > > > > ,"filters" : { >> > > > > > > > "HDFS" : "false" >> > > > > > > > ,"ES" : "exists(field1)" >> > > > > > > > } >> > > > > > > > } >> > > > > > > > >> > > > > > > > *Index On/Off Switch* >> > > > > > > > >> > > > > > > > A simpler solution would be to just provide a >> > > list >> > > > of >> > > > > writers >> > > > > > to >> > > > > > > write >> > > > > > > > messages. The semantics would be as follows: >> > > > > > > > >> > > > > > > > - If the list is unspecified, then the >> > default >> > > > is to >> > > > > write >> > > > > > all >> > > > > > > messages >> > > > > > > > for every writer in the indexing topology >> > > > > > > > - If the list is specified, then a writer >> > will >> > > > write >> > > > > all >> > > > > > messages >> > > > > > > if and >> > > > > > > > only if it is named in the list. >> > > > > > > > >> > > > > > > > Sample indexing config which turns off HDFS >> and >> > > > keeps on >> > > > > > > Elasticsearch: >> > > > > > > > { >> > > > > > > > "index" : "squid" >> > > > > > > > ,"batchSize" : 100 >> > > > > > > > ,"writers" : [ "ES" ] >> > > > > > > > } >> > > > > > > > >> > > > > > > > Thanks in advance for the feedback! Also, if >> > you >> > > > have >> > > > > any >> > > > > > other, >> > > > > > > better >> > > > > > > > ideas than the ones presented here, let me >> know >> > > > too. >> > > > > > > > >> > > > > > > > Best, >> > > > > > > > >> > > > > > > > Casey >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > >> > >> >> >> >> -- >> Nick Allen <n...@nickallen.org> >>