I do agree there is a fair amount of overhead for using another bolt for
this purpose. I am not pointing to the way of implementation. It might be a
way of implementation to segregate two extension points without adding
overhead; I haven't thought about it yet. However, the main issue is
sometimes the type of noise is something that generates an exception on the
parsing side. For example, have a look at the following log:

<166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
(ryanmar)

Clearly duplicate syslog_host throws an exception on parsing, so how
are we going to deal with that at post-parse transformation? It cannot
pass the parsing. This is only a single example of cases that might
affect the production data. Unless Stellar transformation is something
that can be done at pre-parse and for the entire message.


On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
[email protected]> wrote:

> Ali,
>
> Sounds very much like what you’re talking about when you say
> normalization, and what I would understand it as, is the process fulfilled
> by stellar field transformation in the parser config. Agreed that some of
> these will be general, based on common metron standard schema, but others
> will be organisation specific (custom fields overloaded with different
> meanings for instance in CEF, for example). These are very much one of the
> reasons we have the stellar transformation step. I don’t think that should
> be moved to a separate bolt to be honest, because that comes with a fair
> amount of overhead, but logically it is in the parser config rather than
> the parser, so seems to serve this purpose in the post-parse transform, no?
>
> Simon
>
>
>
> > On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]> wrote:
> >
> > Hi Simon,
> >
> > The reason I am asking for a specific normalisation step is due to the
> fact
> > that normalisation is not a general use case which can be used by other
> > users. It is completely bounded to our application. The way we have fixed
> > it, for now, is to add a normalisation step to the parser and clear the
> > incoming data so the parser step can work on that, but I don't like it.
> > There is no point of creating a parser that can handle all of the
> possible
> > noises that can exist in the production data. Even if it is possible to
> > predict every kind of noise in production data there is no point for
> Metron
> > community to focus on building a general purpose parser for a specific
> > device while they can spend that time on developing a cool feature. Even
> if
> > it is possible to predict noises and it is acceptable for the community
> to
> > spend their time on creating that kind of parser why every Metron user
> need
> > that extra normalisation? A user data might be clear at the first step
> and
> > obviously, it only decreases the total throughput without any use for
> that
> > specific user.
> >
> > Imagine there is an additional bolt for normalisation and there is a
> > mechanism to customise the normalisation without changing the general
> > parser for a specific device. We can have a general parser as a common
> > parser for that device and leave the normalisation development to users.
> > However, it is very important to provide the normalisation step as fast
> as
> > possible.
> >
> > Cheers,
> > Ali
> >
> > On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected]>
> wrote:
> >
> >> Yeah, we definitely don't want to rewrite parsing in Stellar.  I would
> >> expect the job of the parser, however, to handle structural issues.  In
> my
> >> mind, parsing is about transforming structures into fields and the role
> of
> >> the field transformations are to transform values.  There's obvious
> overlap
> >> there wherein parsers may do some normalizations/transformations (i.e.
> look
> >> how grok handles timestamps), but it almost always gets us into trouble
> >> when parsers do even moderately complex value transformations.
> >>
> >> As I type this, though, I think I see your point.  What you really want
> is
> >> to chain parsers, have a pre-parser to bring you 80% of the way there
> and
> >> hammer out all the structural issues so you might be able to use a more
> >> generic parser down the chain.  I have often thought that maybe we
> should
> >> expose parsers as Stellar functions which take raw data and emit whole
> >> messages.  This would allow us to compose parsers, so imagine the above
> >> example where you've written a stellar function to normalize the input
> and
> >> you're then passing it to a CSV parser, you could run
> >> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a
> >> parser.
> >>
> >> As for speed, the stellar expression would get compiled into a java
> object,
> >> so it shouldn't be appreciable overhead since we no longer lex and parse
> >> for every message.
> >>
> >> Is this kinda how you were seeing it?
> >>
> >> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball <
> >> [email protected]> wrote:
> >>
> >>> The challenge there I suspect is going to be that you essentially end
> up
> >>> with the actual parser doing very little of value, and then effectively
> >>> trying to write a parser in stellar against a few broad strings, which
> >>> would likely give you all sorts of performance problems.
> >>>
> >>> One solution is to write a very defensive and flexible parser, but that
> >>> would tend to be time consuming.
> >>>
> >>> There is also something to be said for doing some basic transformation
> >>> before the parser topic kafka in something like nifi, but again,
> >>> performance can be an issue there.
> >>>
> >>> If the noise is about broken structure for example, maybe a simple
> >>> pre-process step as part of your parser would make sense, e.g.
> stripping
> >>> syslog headers, or character set conversion, removing very broken bits
> as
> >>> part of the parse method.
> >>>
> >>> In terms of normalisation post-parse, I agree, that 100% a job for
> >>> Stellar, and the fieldTransformations capability. Something I would
> like
> >> to
> >>> see would be a means to use that transformation step to map to a well
> >> known
> >>> (though loosely enforced) schema provided by a governance framework,
> but
> >>> that is a much bigger topic of conversation.
> >>>
> >>> Not of course that not everything has to be parsed just because it’s in
> >>> the message. A relatively loose fitting parser which pulls out the
> >> relevant
> >>> data for the use case would be fine, and likely a lot more tolerant of
> >>> noise than something that felt the need for every field. We do after
> all
> >>> store the original_string for you if you really absolutely have to had
> >>> everything, so a more schema-on-read philosophy certainly applies and
> >> will
> >>> likely side-step a lot of your issues.
> >>>
> >>> Simon
> >>>
> >>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]> wrote:
> >>>>
> >>>> Ok, that's another story.  hmmmm, we don't generally pre-parse becuase
> >> we
> >>>> try to not assume any particular format there (i.e. it could be
> >> strings,
> >>>> could be byte arrays).  Maybe the right answer is to pass the raw,
> >>>> non-normalized data (best effort tyep of thing) through the parser and
> >> do
> >>>> the normalization post-parse..or is there a problem with that?
> >>>>
> >>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Hi Casey,
> >>>>>
> >>>>> It is actually pre-parse process, not a post-parse one. These type of
> >>>>> noises affect the position of an attribute for example and give us
> >>> parsing
> >>>>> exception. The timestamp example was not a good one because that is
> >>>>> actually a post-parse exception.
> >>>>>
> >>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> So, further transformation post-parse was one of the motivating
> >> reasons
> >>>>> for
> >>>>>> Stellar (to do that transformation post-parse).  Is there a
> >> capability
> >>>>> that
> >>>>>> it's lacking that we can add to fit your usecase?
> >>>>>>
> >>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian <
> [email protected]
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I've created a Jira ticket regarding this feature.
> >>>>>>>
> >>>>>>> https://issues.apache.org/jira/browse/METRON-893
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> >> [email protected]
> >>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Currently, we are using normal regex at the Java source code to
> >>>>> handle
> >>>>>>>> those situations. However, it would be nice to have a separate
> bolt
> >>>>> and
> >>>>>>>> deal with them separately. Yeah, I can create a Jira issue
> >> regarding
> >>>>>>> that.
> >>>>>>>> The main reason I am asking for such a feature is the fact that
> >> lack
> >>>>> of
> >>>>>>>> such a feature makes the process of creating some parser for the
> >>>>>>> community
> >>>>>>>> a little painful for us. We need to maintain two different
> >> versions,
> >>>>>> one
> >>>>>>>> for community another for the internal use case. Clearly, noise is
> >> an
> >>>>>>>> inevitable part of real world use cases.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Ali
> >>>>>>>>
> >>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> >>>>> [email protected]
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Are you doing this cleansing all in the parser or are you using
> >> any
> >>>>>>>>> Stellar to do it?
> >>>>>>>>> Can you create a jira?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> >> [email protected])
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> We are facing certain use cases in Metron production that happen
> >> to
> >>>>> be
> >>>>>>>>> related to noisy stream. For example, a wrong timestamp,
> duplicate
> >>>>>>>>> hostname/IP address, etc. To deal with the normalization we have
> >>>>> added
> >>>>>>> an
> >>>>>>>>> additional step for the corresponding parsers to do the data
> >>>>> cleaning.
> >>>>>>>>> Clearly, parsing is a standard factor which is mostly related to
> >> the
> >>>>>>>>> device
> >>>>>>>>> that is generating the data and can be used for the same type of
> >>>>>> device
> >>>>>>>>> everywhere, but normalization is very production dependent and
> >> there
> >>>>>> is
> >>>>>>>>> no
> >>>>>>>>> point of mixing normalization with parsing. It would be nice to
> >>>>> have a
> >>>>>>>>> sperate bolt in a parsing topologies to dedicate to production
> >>>>>>>>> related cleaning process. In that case, eveybody can easily
> >>>>> contribute
> >>>>>>> to
> >>>>>>>>> Metron community with additional parsers without being worried
> >> about
> >>>>>>>>> mixing
> >>>>>>>>> parsers and data cleaning process.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> Ali
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> A.Nazemian
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> A.Nazemian
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> A.Nazemian
> >>>>>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > A.Nazemian
>
>


-- 
A.Nazemian

Reply via email to