Re: Normalization topology or separate normalization bolt for parsing topology

Nick Allen Tue, 02 May 2017 10:03:53 -0700

Yes, and currently that normalization step is the Parsers.

I am not saying the message has to be entirely clear and well-defined.  But
there are a minimum set of expectations that you must have of any data that
you're ingesting.   Once it meets that "minimum set", the parser should be
able to ingest and normalize the message.  Any oddities beyond that
"minimum set" can be handled with Stellar either post-Parsing or in
Enrichment.


It is, of course, a judgement call as to what that minimum set is for you.
You would just need a Parser that matches your definition of "minimum set".

My main point here is that I am not seeing a need to re-architect
anything.  I think we have the right tools, IMHO.









On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian <[email protected]> wrote:

> Hi Nick,
>
> The date could be corrupted due to any reason, and sometimes we haven't got
> any control on the device. Obviously, it is not a big deal if we lose <166>
> severity message, but it could be a different situation for <161>
> severity or an actual critical threat. However, I have mentioned those
> defects as an example to pointed the importance of having a normalisation
> step in Metron processing chain.
>
> I still think there is no guarantee to have an entirely clear and
> well-defined message in the real world use case. If we recognise this
> situation as a problem, then finding a high performance and flexible
> solution is not very hard.
>
> Cheers,
> Ali
>
> On Tue, May 2, 2017 at 11:24 PM, Nick Allen <[email protected]> wrote:
>
> > Before worrying about how to ingest this 'noisy' data, I would want to
> > better understand root cause.  If you cannot even get a valid date
> format,
> > are you sure the data can be trusted?
> >
> > Rather than bending over backwards to try to ingest it, I would first
> make
> > sure the telemetry is not totally bogus to begin with.  Maybe it is
> better
> > that the data is dropped in cases like this.
> >
> > IMHO, that is how I would tackle a problem like this.  Not all data can
> be
> > trusted.
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <[email protected]>
> > wrote:
> >
> > > Are you sure? The syslog_host name is way more complicated than
> something
> > > that can be a coincidence. I need to double check with one of the
> > security
> > > device experts, but I thought it is some kind of noises.
> > >
> > > Yes, we do have more use cases that seem to be corrupted. For example,
> > > having duplicate IP addresses or corrupted date format. Please have a
> > look
> > > at the following message. At least I am sure the date format is
> corrupted
> > > in this one.
> > >
> > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > connection
> > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> > > *y.y.y.y/p2*
> > >
> > > Cheers,
> > > Ali
> > >
> > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > [email protected]> wrote:
> > >
> > > > Is that instance, you're looking at valid syslog which should be
> parsed
> > > as
> > > > such. The repeat host is not really a host in syslog terms, it's an
> > > > application name header which happens to be the same. This is
> > definitely
> > > a
> > > > parser bug which should be handled, esp since the header is perfectly
> > RFC
> > > > compliant.
> > > >
> > > > Do you have any other such cases? My view is that parsers should be
> > > > written with more any case, so should extract all the fields they can
> > > from
> > > > malformed logs, rather than throwing exceptions, but that's more
> about
> > > the
> > > > way we write parsers than having some kind of pre-clean.
> > > >
> > > > Simon
> > > >
> > > > Sent from my iPad
> > > >
> > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]>
> > wrote:
> > > > >
> > > > > I do agree there is a fair amount of overhead for using another
> bolt
> > > for
> > > > > this purpose. I am not pointing to the way of implementation. It
> > might
> > > > be a
> > > > > way of implementation to segregate two extension points without
> > adding
> > > > > overhead; I haven't thought about it yet. However, the main issue
> is
> > > > > sometimes the type of noise is something that generates an
> exception
> > on
> > > > the
> > > > > parsing side. For example, have a look at the following log:
> > > > >
> > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > > (ryanmar)
> > > > >
> > > > > Clearly duplicate syslog_host throws an exception on parsing, so
> how
> > > > > are we going to deal with that at post-parse transformation? It
> > cannot
> > > > > pass the parsing. This is only a single example of cases that might
> > > > > affect the production data. Unless Stellar transformation is
> > something
> > > > > that can be done at pre-parse and for the entire message.
> > > > >
> > > > >
> > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> Ali,
> > > > >>
> > > > >> Sounds very much like what you’re talking about when you say
> > > > >> normalization, and what I would understand it as, is the process
> > > > fulfilled
> > > > >> by stellar field transformation in the parser config. Agreed that
> > some
> > > > of
> > > > >> these will be general, based on common metron standard schema, but
> > > > others
> > > > >> will be organisation specific (custom fields overloaded with
> > different
> > > > >> meanings for instance in CEF, for example). These are very much
> one
> > of
> > > > the
> > > > >> reasons we have the stellar transformation step. I don’t think
> that
> > > > should
> > > > >> be moved to a separate bolt to be honest, because that comes with
> a
> > > fair
> > > > >> amount of overhead, but logically it is in the parser config
> rather
> > > than
> > > > >> the parser, so seems to serve this purpose in the post-parse
> > > transform,
> > > > no?
> > > > >>
> > > > >> Simon
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]>
> > > wrote:
> > > > >>>
> > > > >>> Hi Simon,
> > > > >>>
> > > > >>> The reason I am asking for a specific normalisation step is due
> to
> > > the
> > > > >> fact
> > > > >>> that normalisation is not a general use case which can be used by
> > > other
> > > > >>> users. It is completely bounded to our application. The way we
> have
> > > > fixed
> > > > >>> it, for now, is to add a normalisation step to the parser and
> clear
> > > the
> > > > >>> incoming data so the parser step can work on that, but I don't
> like
> > > it.
> > > > >>> There is no point of creating a parser that can handle all of the
> > > > >> possible
> > > > >>> noises that can exist in the production data. Even if it is
> > possible
> > > to
> > > > >>> predict every kind of noise in production data there is no point
> > for
> > > > >> Metron
> > > > >>> community to focus on building a general purpose parser for a
> > > specific
> > > > >>> device while they can spend that time on developing a cool
> feature.
> > > > Even
> > > > >> if
> > > > >>> it is possible to predict noises and it is acceptable for the
> > > community
> > > > >> to
> > > > >>> spend their time on creating that kind of parser why every Metron
> > > user
> > > > >> need
> > > > >>> that extra normalisation? A user data might be clear at the first
> > > step
> > > > >> and
> > > > >>> obviously, it only decreases the total throughput without any use
> > for
> > > > >> that
> > > > >>> specific user.
> > > > >>>
> > > > >>> Imagine there is an additional bolt for normalisation and there
> is
> > a
> > > > >>> mechanism to customise the normalisation without changing the
> > general
> > > > >>> parser for a specific device. We can have a general parser as a
> > > common
> > > > >>> parser for that device and leave the normalisation development to
> > > > users.
> > > > >>> However, it is very important to provide the normalisation step
> as
> > > fast
> > > > >> as
> > > > >>> possible.
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Ali
> > > > >>>
> > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <
> [email protected]
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I
> > > would
> > > > >>>> expect the job of the parser, however, to handle structural
> > issues.
> > > > In
> > > > >> my
> > > > >>>> mind, parsing is about transforming structures into fields and
> the
> > > > role
> > > > >> of
> > > > >>>> the field transformations are to transform values.  There's
> > obvious
> > > > >> overlap
> > > > >>>> there wherein parsers may do some normalizations/transformations
> > > (i.e.
> > > > >> look
> > > > >>>> how grok handles timestamps), but it almost always gets us into
> > > > trouble
> > > > >>>> when parsers do even moderately complex value transformations.
> > > > >>>>
> > > > >>>> As I type this, though, I think I see your point.  What you
> really
> > > > want
> > > > >> is
> > > > >>>> to chain parsers, have a pre-parser to bring you 80% of the way
> > > there
> > > > >> and
> > > > >>>> hammer out all the structural issues so you might be able to
> use a
> > > > more
> > > > >>>> generic parser down the chain.  I have often thought that maybe
> we
> > > > >> should
> > > > >>>> expose parsers as Stellar functions which take raw data and emit
> > > whole
> > > > >>>> messages.  This would allow us to compose parsers, so imagine
> the
> > > > above
> > > > >>>> example where you've written a stellar function to normalize the
> > > input
> > > > >> and
> > > > >>>> you're then passing it to a CSV parser, you could run
> > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> > specify a
> > > > >>>> parser.
> > > > >>>>
> > > > >>>> As for speed, the stellar expression would get compiled into a
> > java
> > > > >> object,
> > > > >>>> so it shouldn't be appreciable overhead since we no longer lex
> and
> > > > parse
> > > > >>>> for every message.
> > > > >>>>
> > > > >>>> Is this kinda how you were seeing it?
> > > > >>>>
> > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > > >>>> [email protected]> wrote:
> > > > >>>>
> > > > >>>>> The challenge there I suspect is going to be that you
> essentially
> > > end
> > > > >> up
> > > > >>>>> with the actual parser doing very little of value, and then
> > > > effectively
> > > > >>>>> trying to write a parser in stellar against a few broad
> strings,
> > > > which
> > > > >>>>> would likely give you all sorts of performance problems.
> > > > >>>>>
> > > > >>>>> One solution is to write a very defensive and flexible parser,
> > but
> > > > that
> > > > >>>>> would tend to be time consuming.
> > > > >>>>>
> > > > >>>>> There is also something to be said for doing some basic
> > > > transformation
> > > > >>>>> before the parser topic kafka in something like nifi, but
> again,
> > > > >>>>> performance can be an issue there.
> > > > >>>>>
> > > > >>>>> If the noise is about broken structure for example, maybe a
> > simple
> > > > >>>>> pre-process step as part of your parser would make sense, e.g.
> > > > >> stripping
> > > > >>>>> syslog headers, or character set conversion, removing very
> broken
> > > > bits
> > > > >> as
> > > > >>>>> part of the parse method.
> > > > >>>>>
> > > > >>>>> In terms of normalisation post-parse, I agree, that 100% a job
> > for
> > > > >>>>> Stellar, and the fieldTransformations capability. Something I
> > would
> > > > >> like
> > > > >>>> to
> > > > >>>>> see would be a means to use that transformation step to map to
> a
> > > well
> > > > >>>> known
> > > > >>>>> (though loosely enforced) schema provided by a governance
> > > framework,
> > > > >> but
> > > > >>>>> that is a much bigger topic of conversation.
> > > > >>>>>
> > > > >>>>> Not of course that not everything has to be parsed just because
> > > it’s
> > > > in
> > > > >>>>> the message. A relatively loose fitting parser which pulls out
> > the
> > > > >>>> relevant
> > > > >>>>> data for the use case would be fine, and likely a lot more
> > tolerant
> > > > of
> > > > >>>>> noise than something that felt the need for every field. We do
> > > after
> > > > >> all
> > > > >>>>> store the original_string for you if you really absolutely have
> > to
> > > > had
> > > > >>>>> everything, so a more schema-on-read philosophy certainly
> applies
> > > and
> > > > >>>> will
> > > > >>>>> likely side-step a lot of your issues.
> > > > >>>>>
> > > > >>>>> Simon
> > > > >>>>>
> > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]>
> > > wrote:
> > > > >>>>>>
> > > > >>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse
> > > > becuase
> > > > >>>> we
> > > > >>>>>> try to not assume any particular format there (i.e. it could
> be
> > > > >>>> strings,
> > > > >>>>>> could be byte arrays).  Maybe the right answer is to pass the
> > raw,
> > > > >>>>>> non-normalized data (best effort tyep of thing) through the
> > parser
> > > > and
> > > > >>>> do
> > > > >>>>>> the normalization post-parse..or is there a problem with that?
> > > > >>>>>>
> > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > > [email protected]>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hi Casey,
> > > > >>>>>>>
> > > > >>>>>>> It is actually pre-parse process, not a post-parse one. These
> > > type
> > > > of
> > > > >>>>>>> noises affect the position of an attribute for example and
> give
> > > us
> > > > >>>>> parsing
> > > > >>>>>>> exception. The timestamp example was not a good one because
> > that
> > > is
> > > > >>>>>>> actually a post-parse exception.
> > > > >>>>>>>
> > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > > [email protected]
> > > > >
> > > > >>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> So, further transformation post-parse was one of the
> > motivating
> > > > >>>> reasons
> > > > >>>>>>> for
> > > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there a
> > > > >>>> capability
> > > > >>>>>>> that
> > > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > > >>>>>>>>
> > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > > >> [email protected]
> > > > >>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > > >>>> [email protected]
> > > > >>>>>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Currently, we are using normal regex at the Java source
> code
> > > to
> > > > >>>>>>> handle
> > > > >>>>>>>>>> those situations. However, it would be nice to have a
> > separate
> > > > >> bolt
> > > > >>>>>>> and
> > > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> > > > >>>> regarding
> > > > >>>>>>>>> that.
> > > > >>>>>>>>>> The main reason I am asking for such a feature is the fact
> > > that
> > > > >>>> lack
> > > > >>>>>>> of
> > > > >>>>>>>>>> such a feature makes the process of creating some parser
> for
> > > the
> > > > >>>>>>>>> community
> > > > >>>>>>>>>> a little painful for us. We need to maintain two different
> > > > >>>> versions,
> > > > >>>>>>>> one
> > > > >>>>>>>>>> for community another for the internal use case. Clearly,
> > > noise
> > > > is
> > > > >>>> an
> > > > >>>>>>>>>> inevitable part of real world use cases.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Cheers,
> > > > >>>>>>>>>> Ali
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > > >>>>>>> [email protected]
> > > > >>>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you
> > > using
> > > > >>>> any
> > > > >>>>>>>>>>> Stellar to do it?
> > > > >>>>>>>>>>> Can you create a jira?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > > >>>> [email protected])
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> We are facing certain use cases in Metron production that
> > > > happen
> > > > >>>> to
> > > > >>>>>>> be
> > > > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> > > > >> duplicate
> > > > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization
> we
> > > > have
> > > > >>>>>>> added
> > > > >>>>>>>>> an
> > > > >>>>>>>>>>> additional step for the corresponding parsers to do the
> > data
> > > > >>>>>>> cleaning.
> > > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> > related
> > > > to
> > > > >>>> the
> > > > >>>>>>>>>>> device
> > > > >>>>>>>>>>> that is generating the data and can be used for the same
> > type
> > > > of
> > > > >>>>>>>> device
> > > > >>>>>>>>>>> everywhere, but normalization is very production
> dependent
> > > and
> > > > >>>> there
> > > > >>>>>>>> is
> > > > >>>>>>>>>>> no
> > > > >>>>>>>>>>> point of mixing normalization with parsing. It would be
> > nice
> > > to
> > > > >>>>>>> have a
> > > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > > production
> > > > >>>>>>>>>>> related cleaning process. In that case, eveybody can
> easily
> > > > >>>>>>> contribute
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>> Metron community with additional parsers without being
> > > worried
> > > > >>>> about
> > > > >>>>>>>>>>> mixing
> > > > >>>>>>>>>>> parsers and data cleaning process.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Regards,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Ali
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> --
> > > > >>>>>>>>>> A.Nazemian
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> --
> > > > >>>>>>>>> A.Nazemian
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> --
> > > > >>>>>>> A.Nazemian
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> A.Nazemian
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>
>
>
> --
> A.Nazemian
>

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to