Re: Normalization topology or separate normalization bolt for parsing topology

Nick Allen Tue, 02 May 2017 06:25:16 -0700

Before worrying about how to ingest this 'noisy' data, I would want to
better understand root cause.  If you cannot even get a valid date format,
are you sure the data can be trusted?


Rather than bending over backwards to try to ingest it, I would first make
sure the telemetry is not totally bogus to begin with.  Maybe it is better
that the data is dropped in cases like this.

IMHO, that is how I would tackle a problem like this.  Not all data can be
trusted.







On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <[email protected]> wrote:

> Are you sure? The syslog_host name is way more complicated than something
> that can be a coincidence. I need to double check with one of the security
> device experts, but I thought it is some kind of noises.
>
> Yes, we do have more use cases that seem to be corrupted. For example,
> having duplicate IP addresses or corrupted date format. Please have a look
> at the following message. At least I am sure the date format is corrupted
> in this one.
>
> <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP connection
> 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> *y.y.y.y/p2*
>
> Cheers,
> Ali
>
> On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> [email protected]> wrote:
>
> > Is that instance, you're looking at valid syslog which should be parsed
> as
> > such. The repeat host is not really a host in syslog terms, it's an
> > application name header which happens to be the same. This is definitely
> a
> > parser bug which should be handled, esp since the header is perfectly RFC
> > compliant.
> >
> > Do you have any other such cases? My view is that parsers should be
> > written with more any case, so should extract all the fields they can
> from
> > malformed logs, rather than throwing exceptions, but that's more about
> the
> > way we write parsers than having some kind of pre-clean.
> >
> > Simon
> >
> > Sent from my iPad
> >
> > > On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]> wrote:
> > >
> > > I do agree there is a fair amount of overhead for using another bolt
> for
> > > this purpose. I am not pointing to the way of implementation. It might
> > be a
> > > way of implementation to segregate two extension points without adding
> > > overhead; I haven't thought about it yet. However, the main issue is
> > > sometimes the type of noise is something that generates an exception on
> > the
> > > parsing side. For example, have a look at the following log:
> > >
> > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > (ryanmar)
> > >
> > > Clearly duplicate syslog_host throws an exception on parsing, so how
> > > are we going to deal with that at post-parse transformation? It cannot
> > > pass the parsing. This is only a single example of cases that might
> > > affect the production data. Unless Stellar transformation is something
> > > that can be done at pre-parse and for the entire message.
> > >
> > >
> > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > [email protected]> wrote:
> > >
> > >> Ali,
> > >>
> > >> Sounds very much like what you’re talking about when you say
> > >> normalization, and what I would understand it as, is the process
> > fulfilled
> > >> by stellar field transformation in the parser config. Agreed that some
> > of
> > >> these will be general, based on common metron standard schema, but
> > others
> > >> will be organisation specific (custom fields overloaded with different
> > >> meanings for instance in CEF, for example). These are very much one of
> > the
> > >> reasons we have the stellar transformation step. I don’t think that
> > should
> > >> be moved to a separate bolt to be honest, because that comes with a
> fair
> > >> amount of overhead, but logically it is in the parser config rather
> than
> > >> the parser, so seems to serve this purpose in the post-parse
> transform,
> > no?
> > >>
> > >> Simon
> > >>
> > >>
> > >>
> > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]>
> wrote:
> > >>>
> > >>> Hi Simon,
> > >>>
> > >>> The reason I am asking for a specific normalisation step is due to
> the
> > >> fact
> > >>> that normalisation is not a general use case which can be used by
> other
> > >>> users. It is completely bounded to our application. The way we have
> > fixed
> > >>> it, for now, is to add a normalisation step to the parser and clear
> the
> > >>> incoming data so the parser step can work on that, but I don't like
> it.
> > >>> There is no point of creating a parser that can handle all of the
> > >> possible
> > >>> noises that can exist in the production data. Even if it is possible
> to
> > >>> predict every kind of noise in production data there is no point for
> > >> Metron
> > >>> community to focus on building a general purpose parser for a
> specific
> > >>> device while they can spend that time on developing a cool feature.
> > Even
> > >> if
> > >>> it is possible to predict noises and it is acceptable for the
> community
> > >> to
> > >>> spend their time on creating that kind of parser why every Metron
> user
> > >> need
> > >>> that extra normalisation? A user data might be clear at the first
> step
> > >> and
> > >>> obviously, it only decreases the total throughput without any use for
> > >> that
> > >>> specific user.
> > >>>
> > >>> Imagine there is an additional bolt for normalisation and there is a
> > >>> mechanism to customise the normalisation without changing the general
> > >>> parser for a specific device. We can have a general parser as a
> common
> > >>> parser for that device and leave the normalisation development to
> > users.
> > >>> However, it is very important to provide the normalisation step as
> fast
> > >> as
> > >>> possible.
> > >>>
> > >>> Cheers,
> > >>> Ali
> > >>>
> > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected]>
> > >> wrote:
> > >>>
> > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I
> would
> > >>>> expect the job of the parser, however, to handle structural issues.
> > In
> > >> my
> > >>>> mind, parsing is about transforming structures into fields and the
> > role
> > >> of
> > >>>> the field transformations are to transform values.  There's obvious
> > >> overlap
> > >>>> there wherein parsers may do some normalizations/transformations
> (i.e.
> > >> look
> > >>>> how grok handles timestamps), but it almost always gets us into
> > trouble
> > >>>> when parsers do even moderately complex value transformations.
> > >>>>
> > >>>> As I type this, though, I think I see your point.  What you really
> > want
> > >> is
> > >>>> to chain parsers, have a pre-parser to bring you 80% of the way
> there
> > >> and
> > >>>> hammer out all the structural issues so you might be able to use a
> > more
> > >>>> generic parser down the chain.  I have often thought that maybe we
> > >> should
> > >>>> expose parsers as Stellar functions which take raw data and emit
> whole
> > >>>> messages.  This would allow us to compose parsers, so imagine the
> > above
> > >>>> example where you've written a stellar function to normalize the
> input
> > >> and
> > >>>> you're then passing it to a CSV parser, you could run
> > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a
> > >>>> parser.
> > >>>>
> > >>>> As for speed, the stellar expression would get compiled into a java
> > >> object,
> > >>>> so it shouldn't be appreciable overhead since we no longer lex and
> > parse
> > >>>> for every message.
> > >>>>
> > >>>> Is this kinda how you were seeing it?
> > >>>>
> > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > >>>> [email protected]> wrote:
> > >>>>
> > >>>>> The challenge there I suspect is going to be that you essentially
> end
> > >> up
> > >>>>> with the actual parser doing very little of value, and then
> > effectively
> > >>>>> trying to write a parser in stellar against a few broad strings,
> > which
> > >>>>> would likely give you all sorts of performance problems.
> > >>>>>
> > >>>>> One solution is to write a very defensive and flexible parser, but
> > that
> > >>>>> would tend to be time consuming.
> > >>>>>
> > >>>>> There is also something to be said for doing some basic
> > transformation
> > >>>>> before the parser topic kafka in something like nifi, but again,
> > >>>>> performance can be an issue there.
> > >>>>>
> > >>>>> If the noise is about broken structure for example, maybe a simple
> > >>>>> pre-process step as part of your parser would make sense, e.g.
> > >> stripping
> > >>>>> syslog headers, or character set conversion, removing very broken
> > bits
> > >> as
> > >>>>> part of the parse method.
> > >>>>>
> > >>>>> In terms of normalisation post-parse, I agree, that 100% a job for
> > >>>>> Stellar, and the fieldTransformations capability. Something I would
> > >> like
> > >>>> to
> > >>>>> see would be a means to use that transformation step to map to a
> well
> > >>>> known
> > >>>>> (though loosely enforced) schema provided by a governance
> framework,
> > >> but
> > >>>>> that is a much bigger topic of conversation.
> > >>>>>
> > >>>>> Not of course that not everything has to be parsed just because
> it’s
> > in
> > >>>>> the message. A relatively loose fitting parser which pulls out the
> > >>>> relevant
> > >>>>> data for the use case would be fine, and likely a lot more tolerant
> > of
> > >>>>> noise than something that felt the need for every field. We do
> after
> > >> all
> > >>>>> store the original_string for you if you really absolutely have to
> > had
> > >>>>> everything, so a more schema-on-read philosophy certainly applies
> and
> > >>>> will
> > >>>>> likely side-step a lot of your issues.
> > >>>>>
> > >>>>> Simon
> > >>>>>
> > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]>
> wrote:
> > >>>>>>
> > >>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse
> > becuase
> > >>>> we
> > >>>>>> try to not assume any particular format there (i.e. it could be
> > >>>> strings,
> > >>>>>> could be byte arrays).  Maybe the right answer is to pass the raw,
> > >>>>>> non-normalized data (best effort tyep of thing) through the parser
> > and
> > >>>> do
> > >>>>>> the normalization post-parse..or is there a problem with that?
> > >>>>>>
> > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > [email protected]>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Casey,
> > >>>>>>>
> > >>>>>>> It is actually pre-parse process, not a post-parse one. These
> type
> > of
> > >>>>>>> noises affect the position of an attribute for example and give
> us
> > >>>>> parsing
> > >>>>>>> exception. The timestamp example was not a good one because that
> is
> > >>>>>>> actually a post-parse exception.
> > >>>>>>>
> > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> [email protected]
> > >
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> So, further transformation post-parse was one of the motivating
> > >>>> reasons
> > >>>>>>> for
> > >>>>>>>> Stellar (to do that transformation post-parse).  Is there a
> > >>>> capability
> > >>>>>>> that
> > >>>>>>>> it's lacking that we can add to fit your usecase?
> > >>>>>>>>
> > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > >> [email protected]
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > >>>>>>>>>
> > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > >>>> [email protected]
> > >>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Currently, we are using normal regex at the Java source code
> to
> > >>>>>>> handle
> > >>>>>>>>>> those situations. However, it would be nice to have a separate
> > >> bolt
> > >>>>>>> and
> > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> > >>>> regarding
> > >>>>>>>>> that.
> > >>>>>>>>>> The main reason I am asking for such a feature is the fact
> that
> > >>>> lack
> > >>>>>>> of
> > >>>>>>>>>> such a feature makes the process of creating some parser for
> the
> > >>>>>>>>> community
> > >>>>>>>>>> a little painful for us. We need to maintain two different
> > >>>> versions,
> > >>>>>>>> one
> > >>>>>>>>>> for community another for the internal use case. Clearly,
> noise
> > is
> > >>>> an
> > >>>>>>>>>> inevitable part of real world use cases.
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>> Ali
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > >>>>>>> [email protected]
> > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you
> using
> > >>>> any
> > >>>>>>>>>>> Stellar to do it?
> > >>>>>>>>>>> Can you create a jira?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > >>>> [email protected])
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> We are facing certain use cases in Metron production that
> > happen
> > >>>> to
> > >>>>>>> be
> > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> > >> duplicate
> > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we
> > have
> > >>>>>>> added
> > >>>>>>>>> an
> > >>>>>>>>>>> additional step for the corresponding parsers to do the data
> > >>>>>>> cleaning.
> > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly related
> > to
> > >>>> the
> > >>>>>>>>>>> device
> > >>>>>>>>>>> that is generating the data and can be used for the same type
> > of
> > >>>>>>>> device
> > >>>>>>>>>>> everywhere, but normalization is very production dependent
> and
> > >>>> there
> > >>>>>>>> is
> > >>>>>>>>>>> no
> > >>>>>>>>>>> point of mixing normalization with parsing. It would be nice
> to
> > >>>>>>> have a
> > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> production
> > >>>>>>>>>>> related cleaning process. In that case, eveybody can easily
> > >>>>>>> contribute
> > >>>>>>>>> to
> > >>>>>>>>>>> Metron community with additional parsers without being
> worried
> > >>>> about
> > >>>>>>>>>>> mixing
> > >>>>>>>>>>> parsers and data cleaning process.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ali
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> A.Nazemian
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> A.Nazemian
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> A.Nazemian
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> A.Nazemian
> > >>
> > >>
> > >
> > >
> > > --
> > > A.Nazemian
> >
>
>
>
> --
> A.Nazemian
>

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to