Re: Normalization topology or separate normalization bolt for parsing topology

Ali Nazemian Tue, 02 May 2017 07:34:45 -0700

Hi Nick,

The date could be corrupted due to any reason, and sometimes we haven't got
any control on the device. Obviously, it is not a big deal if we lose <166>
severity message, but it could be a different situation for <161>
severity or an actual critical threat. However, I have mentioned those
defects as an example to pointed the importance of having a normalisation
step in Metron processing chain.


I still think there is no guarantee to have an entirely clear and
well-defined message in the real world use case. If we recognise this
situation as a problem, then finding a high performance and flexible
solution is not very hard.

Cheers,
Ali

On Tue, May 2, 2017 at 11:24 PM, Nick Allen <[email protected]> wrote:

> Before worrying about how to ingest this 'noisy' data, I would want to
> better understand root cause.  If you cannot even get a valid date format,
> are you sure the data can be trusted?
>
> Rather than bending over backwards to try to ingest it, I would first make
> sure the telemetry is not totally bogus to begin with.  Maybe it is better
> that the data is dropped in cases like this.
>
> IMHO, that is how I would tackle a problem like this.  Not all data can be
> trusted.
>
>
>
>
>
>
>
> On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <[email protected]>
> wrote:
>
> > Are you sure? The syslog_host name is way more complicated than something
> > that can be a coincidence. I need to double check with one of the
> security
> > device experts, but I thought it is some kind of noises.
> >
> > Yes, we do have more use cases that seem to be corrupted. For example,
> > having duplicate IP addresses or corrupted date format. Please have a
> look
> > at the following message. At least I am sure the date format is corrupted
> > in this one.
> >
> > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> connection
> > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> > *y.y.y.y/p2*
> >
> > Cheers,
> > Ali
> >
> > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > [email protected]> wrote:
> >
> > > Is that instance, you're looking at valid syslog which should be parsed
> > as
> > > such. The repeat host is not really a host in syslog terms, it's an
> > > application name header which happens to be the same. This is
> definitely
> > a
> > > parser bug which should be handled, esp since the header is perfectly
> RFC
> > > compliant.
> > >
> > > Do you have any other such cases? My view is that parsers should be
> > > written with more any case, so should extract all the fields they can
> > from
> > > malformed logs, rather than throwing exceptions, but that's more about
> > the
> > > way we write parsers than having some kind of pre-clean.
> > >
> > > Simon
> > >
> > > Sent from my iPad
> > >
> > > > On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]>
> wrote:
> > > >
> > > > I do agree there is a fair amount of overhead for using another bolt
> > for
> > > > this purpose. I am not pointing to the way of implementation. It
> might
> > > be a
> > > > way of implementation to segregate two extension points without
> adding
> > > > overhead; I haven't thought about it yet. However, the main issue is
> > > > sometimes the type of noise is something that generates an exception
> on
> > > the
> > > > parsing side. For example, have a look at the following log:
> > > >
> > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > (ryanmar)
> > > >
> > > > Clearly duplicate syslog_host throws an exception on parsing, so how
> > > > are we going to deal with that at post-parse transformation? It
> cannot
> > > > pass the parsing. This is only a single example of cases that might
> > > > affect the production data. Unless Stellar transformation is
> something
> > > > that can be done at pre-parse and for the entire message.
> > > >
> > > >
> > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > [email protected]> wrote:
> > > >
> > > >> Ali,
> > > >>
> > > >> Sounds very much like what you’re talking about when you say
> > > >> normalization, and what I would understand it as, is the process
> > > fulfilled
> > > >> by stellar field transformation in the parser config. Agreed that
> some
> > > of
> > > >> these will be general, based on common metron standard schema, but
> > > others
> > > >> will be organisation specific (custom fields overloaded with
> different
> > > >> meanings for instance in CEF, for example). These are very much one
> of
> > > the
> > > >> reasons we have the stellar transformation step. I don’t think that
> > > should
> > > >> be moved to a separate bolt to be honest, because that comes with a
> > fair
> > > >> amount of overhead, but logically it is in the parser config rather
> > than
> > > >> the parser, so seems to serve this purpose in the post-parse
> > transform,
> > > no?
> > > >>
> > > >> Simon
> > > >>
> > > >>
> > > >>
> > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]>
> > wrote:
> > > >>>
> > > >>> Hi Simon,
> > > >>>
> > > >>> The reason I am asking for a specific normalisation step is due to
> > the
> > > >> fact
> > > >>> that normalisation is not a general use case which can be used by
> > other
> > > >>> users. It is completely bounded to our application. The way we have
> > > fixed
> > > >>> it, for now, is to add a normalisation step to the parser and clear
> > the
> > > >>> incoming data so the parser step can work on that, but I don't like
> > it.
> > > >>> There is no point of creating a parser that can handle all of the
> > > >> possible
> > > >>> noises that can exist in the production data. Even if it is
> possible
> > to
> > > >>> predict every kind of noise in production data there is no point
> for
> > > >> Metron
> > > >>> community to focus on building a general purpose parser for a
> > specific
> > > >>> device while they can spend that time on developing a cool feature.
> > > Even
> > > >> if
> > > >>> it is possible to predict noises and it is acceptable for the
> > community
> > > >> to
> > > >>> spend their time on creating that kind of parser why every Metron
> > user
> > > >> need
> > > >>> that extra normalisation? A user data might be clear at the first
> > step
> > > >> and
> > > >>> obviously, it only decreases the total throughput without any use
> for
> > > >> that
> > > >>> specific user.
> > > >>>
> > > >>> Imagine there is an additional bolt for normalisation and there is
> a
> > > >>> mechanism to customise the normalisation without changing the
> general
> > > >>> parser for a specific device. We can have a general parser as a
> > common
> > > >>> parser for that device and leave the normalisation development to
> > > users.
> > > >>> However, it is very important to provide the normalisation step as
> > fast
> > > >> as
> > > >>> possible.
> > > >>>
> > > >>> Cheers,
> > > >>> Ali
> > > >>>
> > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected]
> >
> > > >> wrote:
> > > >>>
> > > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar.  I
> > would
> > > >>>> expect the job of the parser, however, to handle structural
> issues.
> > > In
> > > >> my
> > > >>>> mind, parsing is about transforming structures into fields and the
> > > role
> > > >> of
> > > >>>> the field transformations are to transform values.  There's
> obvious
> > > >> overlap
> > > >>>> there wherein parsers may do some normalizations/transformations
> > (i.e.
> > > >> look
> > > >>>> how grok handles timestamps), but it almost always gets us into
> > > trouble
> > > >>>> when parsers do even moderately complex value transformations.
> > > >>>>
> > > >>>> As I type this, though, I think I see your point.  What you really
> > > want
> > > >> is
> > > >>>> to chain parsers, have a pre-parser to bring you 80% of the way
> > there
> > > >> and
> > > >>>> hammer out all the structural issues so you might be able to use a
> > > more
> > > >>>> generic parser down the chain.  I have often thought that maybe we
> > > >> should
> > > >>>> expose parsers as Stellar functions which take raw data and emit
> > whole
> > > >>>> messages.  This would allow us to compose parsers, so imagine the
> > > above
> > > >>>> example where you've written a stellar function to normalize the
> > input
> > > >> and
> > > >>>> you're then passing it to a CSV parser, you could run
> > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise
> specify a
> > > >>>> parser.
> > > >>>>
> > > >>>> As for speed, the stellar expression would get compiled into a
> java
> > > >> object,
> > > >>>> so it shouldn't be appreciable overhead since we no longer lex and
> > > parse
> > > >>>> for every message.
> > > >>>>
> > > >>>> Is this kinda how you were seeing it?
> > > >>>>
> > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> > > >>>> [email protected]> wrote:
> > > >>>>
> > > >>>>> The challenge there I suspect is going to be that you essentially
> > end
> > > >> up
> > > >>>>> with the actual parser doing very little of value, and then
> > > effectively
> > > >>>>> trying to write a parser in stellar against a few broad strings,
> > > which
> > > >>>>> would likely give you all sorts of performance problems.
> > > >>>>>
> > > >>>>> One solution is to write a very defensive and flexible parser,
> but
> > > that
> > > >>>>> would tend to be time consuming.
> > > >>>>>
> > > >>>>> There is also something to be said for doing some basic
> > > transformation
> > > >>>>> before the parser topic kafka in something like nifi, but again,
> > > >>>>> performance can be an issue there.
> > > >>>>>
> > > >>>>> If the noise is about broken structure for example, maybe a
> simple
> > > >>>>> pre-process step as part of your parser would make sense, e.g.
> > > >> stripping
> > > >>>>> syslog headers, or character set conversion, removing very broken
> > > bits
> > > >> as
> > > >>>>> part of the parse method.
> > > >>>>>
> > > >>>>> In terms of normalisation post-parse, I agree, that 100% a job
> for
> > > >>>>> Stellar, and the fieldTransformations capability. Something I
> would
> > > >> like
> > > >>>> to
> > > >>>>> see would be a means to use that transformation step to map to a
> > well
> > > >>>> known
> > > >>>>> (though loosely enforced) schema provided by a governance
> > framework,
> > > >> but
> > > >>>>> that is a much bigger topic of conversation.
> > > >>>>>
> > > >>>>> Not of course that not everything has to be parsed just because
> > it’s
> > > in
> > > >>>>> the message. A relatively loose fitting parser which pulls out
> the
> > > >>>> relevant
> > > >>>>> data for the use case would be fine, and likely a lot more
> tolerant
> > > of
> > > >>>>> noise than something that felt the need for every field. We do
> > after
> > > >> all
> > > >>>>> store the original_string for you if you really absolutely have
> to
> > > had
> > > >>>>> everything, so a more schema-on-read philosophy certainly applies
> > and
> > > >>>> will
> > > >>>>> likely side-step a lot of your issues.
> > > >>>>>
> > > >>>>> Simon
> > > >>>>>
> > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]>
> > wrote:
> > > >>>>>>
> > > >>>>>> Ok, that's another story.  hmmmm, we don't generally pre-parse
> > > becuase
> > > >>>> we
> > > >>>>>> try to not assume any particular format there (i.e. it could be
> > > >>>> strings,
> > > >>>>>> could be byte arrays).  Maybe the right answer is to pass the
> raw,
> > > >>>>>> non-normalized data (best effort tyep of thing) through the
> parser
> > > and
> > > >>>> do
> > > >>>>>> the normalization post-parse..or is there a problem with that?
> > > >>>>>>
> > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <
> > > [email protected]>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi Casey,
> > > >>>>>>>
> > > >>>>>>> It is actually pre-parse process, not a post-parse one. These
> > type
> > > of
> > > >>>>>>> noises affect the position of an attribute for example and give
> > us
> > > >>>>> parsing
> > > >>>>>>> exception. The timestamp example was not a good one because
> that
> > is
> > > >>>>>>> actually a post-parse exception.
> > > >>>>>>>
> > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <
> > [email protected]
> > > >
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> So, further transformation post-parse was one of the
> motivating
> > > >>>> reasons
> > > >>>>>>> for
> > > >>>>>>>> Stellar (to do that transformation post-parse).  Is there a
> > > >>>> capability
> > > >>>>>>> that
> > > >>>>>>>> it's lacking that we can add to fit your usecase?
> > > >>>>>>>>
> > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> > > >> [email protected]
> > > >>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> I've created a Jira ticket regarding this feature.
> > > >>>>>>>>>
> > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> > > >>>> [email protected]
> > > >>>>>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Currently, we are using normal regex at the Java source code
> > to
> > > >>>>>>> handle
> > > >>>>>>>>>> those situations. However, it would be nice to have a
> separate
> > > >> bolt
> > > >>>>>>> and
> > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> > > >>>> regarding
> > > >>>>>>>>> that.
> > > >>>>>>>>>> The main reason I am asking for such a feature is the fact
> > that
> > > >>>> lack
> > > >>>>>>> of
> > > >>>>>>>>>> such a feature makes the process of creating some parser for
> > the
> > > >>>>>>>>> community
> > > >>>>>>>>>> a little painful for us. We need to maintain two different
> > > >>>> versions,
> > > >>>>>>>> one
> > > >>>>>>>>>> for community another for the internal use case. Clearly,
> > noise
> > > is
> > > >>>> an
> > > >>>>>>>>>> inevitable part of real world use cases.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Cheers,
> > > >>>>>>>>>> Ali
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > > >>>>>>> [email protected]
> > > >>>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you
> > using
> > > >>>> any
> > > >>>>>>>>>>> Stellar to do it?
> > > >>>>>>>>>>> Can you create a jira?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> > > >>>> [email protected])
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We are facing certain use cases in Metron production that
> > > happen
> > > >>>> to
> > > >>>>>>> be
> > > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> > > >> duplicate
> > > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we
> > > have
> > > >>>>>>> added
> > > >>>>>>>>> an
> > > >>>>>>>>>>> additional step for the corresponding parsers to do the
> data
> > > >>>>>>> cleaning.
> > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly
> related
> > > to
> > > >>>> the
> > > >>>>>>>>>>> device
> > > >>>>>>>>>>> that is generating the data and can be used for the same
> type
> > > of
> > > >>>>>>>> device
> > > >>>>>>>>>>> everywhere, but normalization is very production dependent
> > and
> > > >>>> there
> > > >>>>>>>> is
> > > >>>>>>>>>>> no
> > > >>>>>>>>>>> point of mixing normalization with parsing. It would be
> nice
> > to
> > > >>>>>>> have a
> > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to
> > production
> > > >>>>>>>>>>> related cleaning process. In that case, eveybody can easily
> > > >>>>>>> contribute
> > > >>>>>>>>> to
> > > >>>>>>>>>>> Metron community with additional parsers without being
> > worried
> > > >>>> about
> > > >>>>>>>>>>> mixing
> > > >>>>>>>>>>> parsers and data cleaning process.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Regards,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Ali
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> A.Nazemian
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> A.Nazemian
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> A.Nazemian
> > > >>>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> A.Nazemian
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

Reply via email to