Re: [DISCUSS] Management of Elastic and other index schemas

zeo...@gmail.com Wed, 22 Feb 2017 09:25:18 -0800

There's another benefit of this abstraction which could comprehensively get
rid of issues we've had in the past with Lucene specific limitations.  For
instance, we could force that any string has the right attributes to handle
long untokenized strings across the board, taking that out of the
individual developer and reviewer's hands.


Jon

On Mon, Feb 20, 2017, 12:55 PM Nick Allen <n...@nickallen.org> wrote:

> I'm just blowing smoke at 10,000 feet here. :) I think we could engineer it
> to be performant in some manner.
>
> Per your thoughts, It would make sense to have this sort of thing hooked to
> configuration changes, not a check for every message that comes in.
>
> On Mon, Feb 20, 2017 at 2:51 PM, Otto Fowler <ottobackwa...@gmail.com>
> wrote:
>
> > I think that would be interesting to do,
> >
> > validateFields()
> > updateIndexIf()
> > insert()
> >
> > But do you want to take that hit every message?  I’m not sure.
> >
> >
> > What if we instead hooked to configuration such that when you ‘commit’ a
> > configuration
> > change it recalculates and fixes up the index instead?  So don’t do it in
> > the indexer, but have
> > good lifecycle management in the configuration.
> >
> > There are issues there with timing the switchover I’d want to think
> > through, but I like that better
> > than putting that in the stream.
> >
> >
> > On February 20, 2017 at 14:39:57, Nick Allen (n...@nickallen.org) wrote:
> >
> > Since enrichments, and even parsers, can be added on-the-fly, should the
> > ES
> > indexer be intelligent enough to manage the index templates on-the-fly
> > also? Ideally, I should never have to manually install something like an
> > ES template. The indexer should just take care of all that.
> >
> > In the case of the Elasticsearch indexer, if it notices a new field added
> > by an enrichment, or a new source of telemetry, then it should update the
> > ES template on-the-fly also. Ideally, we would never have to manually
> > create/deploy an ES template. It should all happen seamlessly and remain
> > in-sync with whatever enrichments, etc exist.
> >
> >
> >
> >
> >
> > On Mon, Feb 20, 2017 at 2:36 PM, Nick Allen <n...@nickallen.org> wrote:
> >
> > >
> > > Taking this a step further, I think this challenge goes beyond just
> > > parsers. We would also need to solve this problem for enrichments. When
> > I
> > > add an enrichment, I want the enriched data to be indexed accurately.
> > How
> > > can we make that happen?
> > >
> > > - As part of defining an enrichment, should I also be able to specify
> > > the fields and types using this same generic definition?
> > > - Or could this be inferred somehow via an extension to Stellar?
> > >
> > >
> > >
> > > On Mon, Feb 20, 2017 at 2:29 PM, Nick Allen <n...@nickallen.org>
> wrote:
> > >
> > >> I like the flexibility and extensibility of having some kind of
> > internal
> > >> representation (generic definition) of the names and types of the
> > fields
> > >> produced by a parser.
> > >>
> > >> Rather than shipping with an ES template, a parser would ship with a
> > >> generic definition of the field names and data types that it adds to a
> > >> message. The Elasticsearch indexer would then take this generic
> > definition
> > >> and translate it to an Elasticsearch template.
> > >>
> > >> Each new indexer (Solr, etc) would know how to consume this generic
> > >> definition and produce whatever artifacts that it needs to index the
> > data
> > >> accurately.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Feb 18, 2017 at 12:26 PM, James Sirota <jsir...@apache.org>
> > >> wrote:
> > >>
> > >>> I am not sure I agree with packaging source-specific templates with
> > the
> > >>> parser. I think that would make it harder to add additional storage
> > >>> sources. For example, what happens if I have 50 parsers with Solr and
> > ES
> > >>> schemas defined, but now I want to add druid? Now I have to add 50
> > schemas
> > >>> to all my existing parsers, which I don't think makes sense. I think
> > what
> > >>> we should have instead is tuple mappers that map some internal
> > >>> representation of our schema to whatever schema the tool uses. We
> > already
> > >>> somewhat started to move down this path with Kyle defining the schema
> > enum
> > >>> for his ASA parser PR and Simon defining a JSON schema for his CEF
> > parser
> > >>> PR. I think we need to unify these approaches and then propagate them
> > to
> > >>> all the parsers. I think what has to happen is the following:
> > >>>
> > >>> We have to introduce a partial schema for Metron messages where you
> > can
> > >>> enforce a schema on a part of a message you want, but at the same
> time
> > >>> allow enough flexibility for the rest of the message to be flexible.
> > What
> > >>> I mean by that is that you should enforce a schema for things like
> ip,
> > >>> protocol, timestamp, etc, but have a fully flexible structure outside
> > of
> > >>> that.
> > >>>
> > >>> After you do that then you can map the partial schema you defined to
> > es,
> > >>> solr, druid, etc, etc. For the fields you don't have a schema for you
> > just
> > >>> assume they are strings. To add additional storage/indexing source to
> > >>> Metron all you do is define a mapper to that source's schema and load
> > that
> > >>> into our indexing bolt.
> > >>>
> > >>>
> > >>>
> > >>> Thanks,
> > >>> James
> > >>>
> > >>> 17.02.2017, 16:36, "zeo...@gmail.com" <zeo...@gmail.com>:
> > >>> > I think this is a good direction to move things toward - moving
> > >>> indexing
> > >>> > templates to be packaged with parsers (using multiple tiered
> > options)
> > >>> that
> > >>> > are then merged with the possible enrich fields before getting
> added
> > >>> to the
> > >>> > indexing technology in use. Now, to read the proposal thread...
> > >>> >
> > >>> > Jon
> > >>> >
> > >>> > On Fri, Feb 17, 2017, 4:25 PM Simon Elliston Ball <
> > >>> > si...@simonellistonball.com> wrote:
> > >>> >
> > >>> >> I’d broadly agree with that tiered approach.
> > >>> >>
> > >>> >> The version where the parser emits a generic schema, and
> > enrichments
> > >>> >> contribute generic schema chunks to that which get combined into
> an
> > >>> indexer
> > >>> >> specific template generated at the end of the flow, so yes, pretty
> > >>> much
> > >>> >> inline with your proposal. (I did read though it, apologies if I
> > >>> missed any
> > >>> >> of the detail, brain is still a little bit post-RSA!)
> > >>> >>
> > >>> >> Simon
> > >>> >>
> > >>> >> > On 17 Feb 2017, at 12:38, Otto Fowler <ottobackwa...@gmail.com>
> > >>> wrote:
> > >>> >> >
> > >>> >> > We already make them do this now, or they get the defaults. So
> > >>> this is
> > >>> >> no different.
> > >>> >> > Having parsers emit names and types etc, that would be another
> > >>> step - or
> > >>> >> it could be the ‘generic schema’ as implemented actually.
> > >>> >> >
> > >>> >> > A tiered approach - from
> > >>> >> > * you give nothing with the parser - you get whatever ES guesses
> > >>> at but
> > >>> >> you don’t care do you
> > >>> >> > * you give the schema
> > >>> >> > * you give the types and we figure it out for you
> > >>> >> >
> > >>> >> > would be the best to move to.
> > >>> >> >
> > >>> >> > Also, we could use the names and types method tied to enrichment
> > to
> > >>> >> generate indexing templates for enrichment types or deriving them
> > >>> rather,
> > >>> >> which i mention in my proposal.
> > >>> >> >
> > >>> >> > I’m starting to think you haven’t rushed out to read it Simon ;)
> > >>> >> >
> > >>> >> >
> > >>> >> >
> > >>> >> > On February 17, 2017 at 15:24:37, Simon Elliston Ball (
> > >>> >> si...@simonellistonball.com <mailto:si...@simonellistonball.com>)
> > >>> wrote:
> > >>> >> >
> > >>> >> >> I like that, to an extent… Forcing the provision of explicit
> > >>> schema
> > >>> >> might be a bit of a load for parser development. I’m assuming that
> > >>> custom
> > >>> >> parsers would be pushed towards the same packaging approach.
> > >>> >> >>
> > >>> >> >> Would it make sense to require the parser to emit field names
> > and
> > >>> types
> > >>> >> expected, and then for us to provide a means of creating the
> > >>> templates for
> > >>> >> supported indices, and push the actual template management to the
> > >>> index
> > >>> >> layer rather than the parsing layer. Schema is after all
> determined
> > >>> not
> > >>> >> just by a parser, but also by the combination of enrichments and
> > >>> models
> > >>> >> applied.
> > >>> >> >>
> > >>> >> >> We could also of course provide an override option within your
> > >>> proposed
> > >>> >> parser package model to allow any destination specific
> > configuration
> > >>> of the
> > >>> >> indexing template.
> > >>> >> >>
> > >>> >> >> Simon
> > >>> >> >>
> > >>> >> >> > On 17 Feb 2017, at 12:01, Otto Fowler <
> ottobackwa...@gmail.com
> > >>> >> <mailto:ottobackwa...@gmail.com>> wrote:
> > >>> >> >> >
> > >>> >> >> > I think we can get there from my proposal.
> > >>> >> >> > A source may package:
> > >>> >> >> > * explicit schemas ( ES, SOLR, FOO )
> > >>> >> >> > * a generic to be invented schema for a to be invented
> > pluggable
> > >>> >> indexing
> > >>> >> >> > component :)
> > >>> >> >> > and we’ll be able to handle it.
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> >
> > >>> >> >> > On February 17, 2017 at 14:39:07, Kyle Richardson (
> > >>> >> kylerichards...@gmail.com <mailto:kylerichards...@gmail.com>)
> > >>> >> >> > wrote:
> > >>> >> >> >
> > >>> >> >> > I personally like the idea of a typed schema per parser that
> > we
> > >>> could
> > >>> >> >> > translate to multiple targets. This would allow us a lot more
> > >>> >> modularity
> > >>> >> >> > and extensibility in indexing down the road.
> > >>> >> >> >
> > >>> >> >> > -Kyle
> > >>> >> >> >
> > >>> >> >> > On Fri, Feb 17, 2017 at 1:59 PM, Simon Elliston Ball <
> > >>> >> >> > si...@simonellistonball.com <mailto:simon@
> > simonellistonball.com
> > >>> >>
> > >>> >> wrote:
> > >>> >> >> >
> > >>> >> >> >> That sounds like a great idea Otto. Do you have any early
> > >>> design on
> > >>> >> that
> > >>> >> >> >> we can look at. Also, rather than just elastic templates do
> > you
> > >>> >> think we
> > >>> >> >> >> should have some sort of typed schema we could translate to
> > >>> multiple
> > >>> >> >> >> targets (solr, elastic, ur... other...) or are you thinking
> > of
> > >>> >> packaging
> > >>> >> >> >> specific scheme assets like template json with the parser?
> > >>> >> >> >>
> > >>> >> >> >> Simon
> > >>> >> >> >>
> > >>> >> >> >>> On 17 Feb 2017, at 18:42, Otto Fowler <
> > >>> ottobackwa...@gmail.com
> > >>> >> <mailto:ottobackwa...@gmail.com>> wrote:
> > >>> >> >> >>>
> > >>> >> >> >>>
> > >>> >> >> >>> Not to jump the gun, but I’m crafting a proposal about
> > >>> parsers and
> > >>> >> one
> > >>> >> >> >> of the things I am going to propose relates to having the ES
> > >>> >> Template for
> > >>> >> >> > a
> > >>> >> >> >> given parser installed or packaged with the parser. We could
> > >>> load the
> > >>> >> >> >> template from there, edit, save and deploy etc. We can
> extend
> > >>> that
> > >>> >> >> > concept
> > >>> >> >> >> more and more later (drafts, versioning etc )
> > >>> >> >> >>>
> > >>> >> >> >>>
> > >>> >> >> >>>> On February 17, 2017 at 13:22:45, Simon Elliston Ball (
> > >>> >> >> >> si...@simonellistonball.com <mailto:simon@simonellistonbal
> > >>> l.com>)
> > >>> >> wrote:
> > >>> >> >> >>>>
> > >>> >> >> >>>> A little while ago the issue of managing Elastic templates
> > >>> for new
> > >>> >> >> >> sensor configs came up, and we didn’t quite put it to bed.
> > >>> >> >> >>>>
> > >>> >> >> >>>> When creating new sensors, I almost invariably find the
> > >>> >> auto-generated
> > >>> >> >> >> schemas for elastic pick some incorrect types. I also find I
> > >>> have to
> > >>> >> >> >> recreate indexes every time to push in the proper dynamic
> > >>> templates
> > >>> >> for
> > >>> >> >> >> things like geo enrichment fields.
> > >>> >> >> >>>>
> > >>> >> >> >>>> So, my questions are:
> > >>> >> >> >>>> How should we address elastic template for new sensors?
> > >>> >> >> >>>> Do we have circumstances where we would need to configure
> > >>> types, or
> > >>> >> >> > can
> > >>> >> >> >> we get away with inferring them?
> > >>> >> >> >>>> Should we just add some additional dynamic templates to
> > >>> cover our
> > >>> >> >> >> common fields like timestamp (the most common culprit I find
> > >>> for
> > >>> >> >> > incorrect
> > >>> >> >> >> typing)?
> > >>> >> >> >>>>
> > >>> >> >> >>>> I’d also like to think about ways we can generalise this.
> > >>> Does
> > >>> >> anyone
> > >>> >> >> >> have any thoughts on what sort of additional index schemes
> we
> > >>> should
> > >>> >> want
> > >>> >> >> >> to infer (solr seems an obvious one, any others?).
> > >>> >> >> >>>>
> > >>> >> >> >>>> Thoughts on a well typed, schemaed and easily indexed
> > >>> postcard
> > >>> >> please
> > >>> >> >> > :)
> > >>> >> >> >>>>
> > >>> >> >> >>>> Simon
> > >>> >> >> >>
> > >>> >>
> > >>> >> --
> > >>> >
> > >>> > Jon
> > >>> >
> > >>> > Sent from my mobile device
> > >>>
> > >>> -------------------
> > >>> Thank you,
> > >>>
> > >>> James Sirota
> > >>> PPMC- Apache Metron (Incubating)
> > >>> jsirota AT apache DOT org
> > >>>
> > >>
> > >>
> > >
> >
> >
>
-- 

Jon

Sent from my mobile device

Re: [DISCUSS] Management of Elastic and other index schemas

Reply via email to