Re: [DISCUSS] Turning off indexing writers feature discussion

Carolyn Duby Fri, 13 Jan 2017 07:00:18 -0800

For larger installations you need to control what is indexed so you don’t end 
up with a nasty elastic search situation and so you can mine the data later for 
reports and training ml models.


Thanks
Carolyn




On 1/13/17, 9:40 AM, "Casey Stella" <ceste...@gmail.com> wrote:

>OH that's a good idea!
>
>On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen <n...@nickallen.org> wrote:
>
>> I like the "Index Filtering" option based on the flexibility that it
>> provides.  Should each output (HDFS, ES, etc) have its own configuration
>> settings?  For example, aren't things like batching handled separately for
>> HDFS versus Elasticsearch?
>>
>> Something along the lines of...
>>
>> {
>>   "hdfs" : {
>>     "when": "exists(field1)",
>>     "batchSize": 100
>>   },
>>
>>   "elasticsearch" : {
>>     "when": "true",
>>     "batchSize": 1000,
>>     "index": "squid"
>>   }
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella <ceste...@gmail.com> wrote:
>>
>> > Yeah, I tend to like the first option too.  Any opposition to that from
>> > anyone?
>> >
>> > The points brought up are good ones and I think that it may be worth a
>> > broader discussion of the requirements of indexing in a separate dev list
>> > thread.  Maybe a list of desires with coherent use-cases justifying them
>> so
>> > we can think about how this stuff should work and where the natural
>> > extension points should be.  Afterall, we need to toe the line between
>> > engineering and overengineering for features nobody will want.
>> >
>> > I'm not sure about the extensions to the standard fields.  I'm torn
>> between
>> > the notions that we should have no standard fields vs we should have a
>> > boatload of standard fields (with most of them empty).  I exchange
>> > positions fairly regularly on that question. ;)  It may be worth a dev
>> list
>> > discussion to lay out how you imagine an extension of standard fields and
>> > how it might look as implemented in Metron.
>> >
>> > Casey
>> >
>> > Casey
>> >
>> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
>> > kylerichards...@gmail.com>
>> > wrote:
>> >
>> > > I'll second my preference for the first option. I think the ability to
>> > use
>> > > Stellar filters to customize indexing would be a big win.
>> > >
>> > > I'm glad Matt brought up the point about data lake and CEP. I think
>> this
>> > is
>> > > a really important use case that we need to consider. Take a simple
>> > > example... If I have data coming in from 3 different firewall vendors
>> > and 2
>> > > different web proxy/url filtering vendors and I want to be able to
>> > analyze
>> > > that data set, I need the data to be indexed all together (likely in
>> > HDFS)
>> > > and to have a normalized schema such that IP address, URL, and user
>> name
>> > > (to take a few) can be easily queried and aggregated. I can also
>> envision
>> > > scenarios where I would want to index data based on attributes other
>> than
>> > > sensor, business unit or subsidiary for example.
>> > >
>> > > I've been wanted to propose extending our 7 standard fields to include
>> > > things like URL and user. Is there community interest/support for
>> moving
>> > in
>> > > that direction? If so, I'll start a new thread.
>> > >
>> > > Thanks!
>> > >
>> > > -Kyle
>> > >
>> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley <ma...@apache.org> wrote:
>> > >
>> > > > Ah, I see.  If overriding the default index name allows using the
>> same
>> > > > name for multiple sensors, then the goal can be achieved.
>> > > > Thanks,
>> > > > --Matt
>> > > >
>> > > >
>> > > > On 1/12/17, 3:30 PM, "Casey Stella" <ceste...@gmail.com> wrote:
>> > > >
>> > > >     Oh, you could!  Let's say you have a syslog parser with data from
>> > > > sources 1
>> > > >     2 and 3.  You'd end up with one kafka queue with 3 parsers
>> attached
>> > > to
>> > > > that
>> > > >     queue, each picking part the messages from source 1, 2 and 3.
>> > They'd
>> > > > go
>> > > >     through separate enrichment and into the indexing topology.  In
>> the
>> > > >     indexing topology, you could specify the same index name "syslog"
>> > and
>> > > > all
>> > > >     of the messages go into the same index for CEP querying if so
>> > > desired.
>> > > >
>> > > >     On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley <ma...@apache.org>
>> > > wrote:
>> > > >
>> > > >     > Syslog is hell on parsers – I know, I worked at LogLogic in a
>> > > > previous
>> > > >     > life.  It makes perfect sense to route different lines from
>> > syslog
>> > > > through
>> > > >     > different appropriate parsers.  But a lot of what the parsers
>> do
>> > is
>> > > >     > identify consistent subsets of metadata and annotate it – eg,
>> > > > src_ip_addr,
>> > > >     > event timestamps, etc.  Once those metadata are annotated and
>> > > > available
>> > > >     > with common field names, why doesn’t it make sense to index the
>> > > > messages
>> > > >     > together, for CEP querying?  I think Splunk has illustrated
>> this
>> > > > model.
>> > > >     >
>> > > >     > On 1/12/17, 3:00 PM, "Casey Stella" <ceste...@gmail.com>
>> wrote:
>> > > >     >
>> > > >     >     yeah, I mean, honestly, I think the approach that we've
>> taken
>> > > for
>> > > >     > sources
>> > > >     >     which aggregate different types of data is to provide
>> filters
>> > > at
>> > > > the
>> > > >     > parser
>> > > >     >     level and have multiple parser topologies (with different,
>> > > > possibly
>> > > >     >     mutually exclusive filters) running.  This would be a
>> > > completely
>> > > >     > separate
>> > > >     >     sensor.  Imagine a syslog data source that aggregates and
>> you
>> > > > want to
>> > > >     > pick
>> > > >     >     apart certain pieces of messages.  This is why the initial
>> > > > thought and
>> > > >     >     architecture was one index per sensor.
>> > > >     >
>> > > >     >     On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley <
>> > ma...@apache.org>
>> > > > wrote:
>> > > >     >
>> > > >     >     > I’m thinking that CEP (Complex Event Processing) is
>> > contrary
>> > > > to the
>> > > >     > idea
>> > > >     >     > of silo-ing data per sensor.
>> > > >     >     > Now it’s true that some of those sensors are already
>> > > > aggregating
>> > > >     > data from
>> > > >     >     > multiple sources, so maybe I’m wrong here.
>> > > >     >     > But it just seems to me that the “data lake” insights
>> come
>> > > from
>> > > >     > being able
>> > > >     >     > to make decisions over the whole mass of data rather than
>> > > just
>> > > >     > vertical
>> > > >     >     > slices of it.
>> > > >     >     >
>> > > >     >     > On 1/12/17, 2:15 PM, "Casey Stella" <ceste...@gmail.com>
>> > > > wrote:
>> > > >     >     >
>> > > >     >     >     Hey Matt,
>> > > >     >     >
>> > > >     >     >     Thanks for the comment!
>> > > >     >     >     1. At the moment, we only have one index name, the
>> > > default
>> > > > of
>> > > >     > which is
>> > > >     >     > the
>> > > >     >     >     sensor name but that's entirely up to the user.  This
>> > is
>> > > > sensor
>> > > >     >     > specific,
>> > > >     >     >     so it'd be a separate config for each sensor.  If we
>> > want
>> > > > to
>> > > >     > build
>> > > >     >     > multiple
>> > > >     >     >     indices per sensor, we'd have to think carefully
>> about
>> > > how
>> > > > to do
>> > > >     > that
>> > > >     >     > and
>> > > >     >     >     would be a bigger undertaking.  I guess I can see the
>> > > use,
>> > > > though
>> > > >     >     > (redirect
>> > > >     >     >     messages to one index vs another based on a predicate
>> > for
>> > > > a given
>> > > >     >     > sensor).
>> > > >     >     >     Anyway, not where I was originally thinking that this
>> > > > discussion
>> > > >     > would
>> > > >     >     > go,
>> > > >     >     >     but it's an interesting point.
>> > > >     >     >
>> > > >     >     >     2. I hadn't thought through the implementation quite
>> > yet,
>> > > > but we
>> > > >     > don't
>> > > >     >     >     actually have a splitter bolt in that topology, just
>> a
>> > > > spout
>> > > >     > that goes
>> > > >     >     > to
>> > > >     >     >     the elasticsearch writer and also to the hdfs writer.
>> > > >     >     >
>> > > >     >     >     On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley <
>> > > > ma...@apache.org>
>> > > >     > wrote:
>> > > >     >     >
>> > > >     >     >     > Casey, good to have controls like this.  Couple
>> > > > questions:
>> > > >     >     >     >
>> > > >     >     >     > 1. Regarding the “index” : “squid” name/value pair,
>> > is
>> > > > the
>> > > >     > index name
>> > > >     >     >     > expected to always be a sensor name?  Or is the
>> given
>> > > > json
>> > > >     > structure
>> > > >     >     >     > subordinate to a sensor name in zookeeper?  Or can
>> we
>> > > > build
>> > > >     > arbitrary
>> > > >     >     >     > indexes with this new specification, independent of
>> > > > sensor?
>> > > >     > Should
>> > > >     >     > there
>> > > >     >     >     > actually be a list of “indexes”, ie
>> > > >     >     >     > { “indexes” : [
>> > > >     >     >     >         {“index” : “name1”,
>> > > >     >     >     >                 …
>> > > >     >     >     >         },
>> > > >     >     >     >         {“index” : “name2”,
>> > > >     >     >     >                 …
>> > > >     >     >     >         } ]
>> > > >     >     >     > }
>> > > >     >     >     >
>> > > >     >     >     > 2. Would the filtering / writer selection logic
>> take
>> > > > place in
>> > > >     > the
>> > > >     >     > indexing
>> > > >     >     >     > topology splitter bolt?  Seems like that would have
>> > the
>> > > >     > smallest
>> > > >     >     > impact on
>> > > >     >     >     > current implementation, no?
>> > > >     >     >     >
>> > > >     >     >     > Sorry if these are already answered in PR-415, I
>> > > haven’t
>> > > > had
>> > > >     > time to
>> > > >     >     >     > review that one yet.
>> > > >     >     >     > Thanks,
>> > > >     >     >     > --Matt
>> > > >     >     >     >
>> > > >     >     >     >
>> > > >     >     >     > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
>> > > >     >     > michael.miklav...@gmail.com>
>> > > >     >     >     > wrote:
>> > > >     >     >     >
>> > > >     >     >     >     I like the flexibility and expressibility of
>> the
>> > > > first
>> > > >     > option
>> > > >     >     > with
>> > > >     >     >     > Stellar
>> > > >     >     >     >     filters.
>> > > >     >     >     >
>> > > >     >     >     >     M
>> > > >     >     >     >
>> > > >     >     >     >     On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
>> > > >     >     > ceste...@gmail.com>
>> > > >     >     >     > wrote:
>> > > >     >     >     >
>> > > >     >     >     >     > As of METRON-652 <https://github.com/apache/
>> > > >     >     >     > incubator-metron/pull/415>, we
>> > > >     >     >     >     > will have decoupled the indexing
>> configuration
>> > > > from the
>> > > >     >     > enrichment
>> > > >     >     >     >     > configuration.  As an immediate follow-up to
>> > > that,
>> > > > I'd
>> > > >     > like to
>> > > >     >     >     > provide the
>> > > >     >     >     >     > ability to turn off and on writers via the
>> > > > configs.  I'd
>> > > >     > like
>> > > >     >     > to get
>> > > >     >     >     > some
>> > > >     >     >     >     > community feedback on how the functionality
>> > > should
>> > > > work,
>> > > >     > if
>> > > >     >     > y'all are
>> > > >     >     >     >     > amenable. :)
>> > > >     >     >     >     >
>> > > >     >     >     >     >
>> > > >     >     >     >     > As of now, we have 3 possible writers which
>> can
>> > > be
>> > > > used
>> > > >     > in the
>> > > >     >     >     > indexing
>> > > >     >     >     >     > topology:
>> > > >     >     >     >     >
>> > > >     >     >     >     >    - Solr
>> > > >     >     >     >     >    - Elasticsearch
>> > > >     >     >     >     >    - HDFS
>> > > >     >     >     >     >
>> > > >     >     >     >     > HDFS is always used, elasticsearch or solr is
>> > > used
>> > > >     > depending
>> > > >     >     > on how
>> > > >     >     >     > you
>> > > >     >     >     >     > start the indexing topology.
>> > > >     >     >     >     >
>> > > >     >     >     >     > A couple of proposals come to mind
>> immediately:
>> > > >     >     >     >     >
>> > > >     >     >     >     > *Index Filtering*
>> > > >     >     >     >     >
>> > > >     >     >     >     > You would be able to specify a filter as
>> > defined
>> > > > by a
>> > > >     > stellar
>> > > >     >     >     > statement
>> > > >     >     >     >     > (likely a reuse of the StellarFilter that
>> > exists
>> > > > in the
>> > > >     >     > Parsers)
>> > > >     >     >     > which
>> > > >     >     >     >     > would allow you to indicate on a
>> > > > message-by-message basis
>> > > >     >     > whether or
>> > > >     >     >     > not to
>> > > >     >     >     >     > write the message.
>> > > >     >     >     >     >
>> > > >     >     >     >     > The semantics of this would be as follows:
>> > > >     >     >     >     >
>> > > >     >     >     >     >    - Default (i.e. unspecified) is to pass
>> > > > everything
>> > > >     > through
>> > > >     >     > (hence
>> > > >     >     >     >     >    backwards compatible with the current
>> > default
>> > > > config).
>> > > >     >     >     >     >    - Messages which have the associated
>> stellar
>> > > > statement
>> > > >     >     > evaluate
>> > > >     >     >     > to true
>> > > >     >     >     >     >    for the writer type will be written,
>> > otherwise
>> > > > not.
>> > > >     >     >     >     >
>> > > >     >     >     >     >
>> > > >     >     >     >     > Sample indexing config which would write out
>> no
>> > > > messages
>> > > >     > to
>> > > >     >     > HDFS and
>> > > >     >     >     > write
>> > > >     >     >     >     > out only messages containing a field called
>> > > > "field1":
>> > > >     >     >     >     > {
>> > > >     >     >     >     >    "index" : "squid"
>> > > >     >     >     >     >   ,"batchSize" : 100
>> > > >     >     >     >     >   ,"filters" : {
>> > > >     >     >     >     >       "HDFS" : "false"
>> > > >     >     >     >     >      ,"ES" : "exists(field1)"
>> > > >     >     >     >     >                  }
>> > > >     >     >     >     > }
>> > > >     >     >     >     >
>> > > >     >     >     >     > *Index On/Off Switch*
>> > > >     >     >     >     >
>> > > >     >     >     >     > A simpler solution would be to just provide a
>> > > list
>> > > > of
>> > > >     > writers
>> > > >     >     > to
>> > > >     >     >     > write
>> > > >     >     >     >     > messages.  The semantics would be as follows:
>> > > >     >     >     >     >
>> > > >     >     >     >     >    - If the list is unspecified, then the
>> > default
>> > > > is to
>> > > >     > write
>> > > >     >     > all
>> > > >     >     >     > messages
>> > > >     >     >     >     >    for every writer in the indexing topology
>> > > >     >     >     >     >    - If the list is specified, then a writer
>> > will
>> > > > write
>> > > >     > all
>> > > >     >     > messages
>> > > >     >     >     > if and
>> > > >     >     >     >     >    only if it is named in the list.
>> > > >     >     >     >     >
>> > > >     >     >     >     > Sample indexing config which turns off HDFS
>> and
>> > > > keeps on
>> > > >     >     >     > Elasticsearch:
>> > > >     >     >     >     > {
>> > > >     >     >     >     >    "index" : "squid"
>> > > >     >     >     >     >   ,"batchSize" : 100
>> > > >     >     >     >     >   ,"writers" : [ "ES" ]
>> > > >     >     >     >     > }
>> > > >     >     >     >     >
>> > > >     >     >     >     > Thanks in advance for the feedback!  Also, if
>> > you
>> > > > have
>> > > >     > any
>> > > >     >     > other,
>> > > >     >     >     > better
>> > > >     >     >     >     > ideas than the ones presented here, let me
>> know
>> > > > too.
>> > > >     >     >     >     >
>> > > >     >     >     >     > Best,
>> > > >     >     >     >     >
>> > > >     >     >     >     > Casey
>> > > >     >     >     >     >
>> > > >     >     >     >
>> > > >     >     >     >
>> > > >     >     >     >
>> > > >     >     >     >
>> > > >     >     >     >
>> > > >     >     >
>> > > >     >     >
>> > > >     >     >
>> > > >     >     >
>> > > >     >     >
>> > > >     >
>> > > >     >
>> > > >     >
>> > > >     >
>> > > >     >
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Nick Allen <n...@nickallen.org>
>>

Re: [DISCUSS] Turning off indexing writers feature discussion

Reply via email to