Hi Nick, The date could be corrupted due to any reason, and sometimes we haven't got any control on the device. Obviously, it is not a big deal if we lose <166> severity message, but it could be a different situation for <161> severity or an actual critical threat. However, I have mentioned those defects as an example to pointed the importance of having a normalisation step in Metron processing chain.
I still think there is no guarantee to have an entirely clear and well-defined message in the real world use case. If we recognise this situation as a problem, then finding a high performance and flexible solution is not very hard. Cheers, Ali On Tue, May 2, 2017 at 11:24 PM, Nick Allen <[email protected]> wrote: > Before worrying about how to ingest this 'noisy' data, I would want to > better understand root cause. If you cannot even get a valid date format, > are you sure the data can be trusted? > > Rather than bending over backwards to try to ingest it, I would first make > sure the telemetry is not totally bogus to begin with. Maybe it is better > that the data is dropped in cases like this. > > IMHO, that is how I would tackle a problem like this. Not all data can be > trusted. > > > > > > > > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <[email protected]> > wrote: > > > Are you sure? The syslog_host name is way more complicated than something > > that can be a coincidence. I need to double check with one of the > security > > device experts, but I thought it is some kind of noises. > > > > Yes, we do have more use cases that seem to be corrupted. For example, > > having duplicate IP addresses or corrupted date format. Please have a > look > > at the following message. At least I am sure the date format is corrupted > > in this one. > > > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP > connection > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2* > > *y.y.y.y/p2* > > > > Cheers, > > Ali > > > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball < > > [email protected]> wrote: > > > > > Is that instance, you're looking at valid syslog which should be parsed > > as > > > such. The repeat host is not really a host in syslog terms, it's an > > > application name header which happens to be the same. This is > definitely > > a > > > parser bug which should be handled, esp since the header is perfectly > RFC > > > compliant. > > > > > > Do you have any other such cases? My view is that parsers should be > > > written with more any case, so should extract all the fields they can > > from > > > malformed logs, rather than throwing exceptions, but that's more about > > the > > > way we write parsers than having some kind of pre-clean. > > > > > > Simon > > > > > > Sent from my iPad > > > > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]> > wrote: > > > > > > > > I do agree there is a fair amount of overhead for using another bolt > > for > > > > this purpose. I am not pointing to the way of implementation. It > might > > > be a > > > > way of implementation to segregate two extension points without > adding > > > > overhead; I haven't thought about it yet. However, the main issue is > > > > sometimes the type of noise is something that generates an exception > on > > > the > > > > parsing side. For example, have a look at the following log: > > > > > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP > > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0 > > > > (ryanmar) > > > > > > > > Clearly duplicate syslog_host throws an exception on parsing, so how > > > > are we going to deal with that at post-parse transformation? It > cannot > > > > pass the parsing. This is only a single example of cases that might > > > > affect the production data. Unless Stellar transformation is > something > > > > that can be done at pre-parse and for the entire message. > > > > > > > > > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball < > > > > [email protected]> wrote: > > > > > > > >> Ali, > > > >> > > > >> Sounds very much like what you’re talking about when you say > > > >> normalization, and what I would understand it as, is the process > > > fulfilled > > > >> by stellar field transformation in the parser config. Agreed that > some > > > of > > > >> these will be general, based on common metron standard schema, but > > > others > > > >> will be organisation specific (custom fields overloaded with > different > > > >> meanings for instance in CEF, for example). These are very much one > of > > > the > > > >> reasons we have the stellar transformation step. I don’t think that > > > should > > > >> be moved to a separate bolt to be honest, because that comes with a > > fair > > > >> amount of overhead, but logically it is in the parser config rather > > than > > > >> the parser, so seems to serve this purpose in the post-parse > > transform, > > > no? > > > >> > > > >> Simon > > > >> > > > >> > > > >> > > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]> > > wrote: > > > >>> > > > >>> Hi Simon, > > > >>> > > > >>> The reason I am asking for a specific normalisation step is due to > > the > > > >> fact > > > >>> that normalisation is not a general use case which can be used by > > other > > > >>> users. It is completely bounded to our application. The way we have > > > fixed > > > >>> it, for now, is to add a normalisation step to the parser and clear > > the > > > >>> incoming data so the parser step can work on that, but I don't like > > it. > > > >>> There is no point of creating a parser that can handle all of the > > > >> possible > > > >>> noises that can exist in the production data. Even if it is > possible > > to > > > >>> predict every kind of noise in production data there is no point > for > > > >> Metron > > > >>> community to focus on building a general purpose parser for a > > specific > > > >>> device while they can spend that time on developing a cool feature. > > > Even > > > >> if > > > >>> it is possible to predict noises and it is acceptable for the > > community > > > >> to > > > >>> spend their time on creating that kind of parser why every Metron > > user > > > >> need > > > >>> that extra normalisation? A user data might be clear at the first > > step > > > >> and > > > >>> obviously, it only decreases the total throughput without any use > for > > > >> that > > > >>> specific user. > > > >>> > > > >>> Imagine there is an additional bolt for normalisation and there is > a > > > >>> mechanism to customise the normalisation without changing the > general > > > >>> parser for a specific device. We can have a general parser as a > > common > > > >>> parser for that device and leave the normalisation development to > > > users. > > > >>> However, it is very important to provide the normalisation step as > > fast > > > >> as > > > >>> possible. > > > >>> > > > >>> Cheers, > > > >>> Ali > > > >>> > > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected] > > > > > >> wrote: > > > >>> > > > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar. I > > would > > > >>>> expect the job of the parser, however, to handle structural > issues. > > > In > > > >> my > > > >>>> mind, parsing is about transforming structures into fields and the > > > role > > > >> of > > > >>>> the field transformations are to transform values. There's > obvious > > > >> overlap > > > >>>> there wherein parsers may do some normalizations/transformations > > (i.e. > > > >> look > > > >>>> how grok handles timestamps), but it almost always gets us into > > > trouble > > > >>>> when parsers do even moderately complex value transformations. > > > >>>> > > > >>>> As I type this, though, I think I see your point. What you really > > > want > > > >> is > > > >>>> to chain parsers, have a pre-parser to bring you 80% of the way > > there > > > >> and > > > >>>> hammer out all the structural issues so you might be able to use a > > > more > > > >>>> generic parser down the chain. I have often thought that maybe we > > > >> should > > > >>>> expose parsers as Stellar functions which take raw data and emit > > whole > > > >>>> messages. This would allow us to compose parsers, so imagine the > > > above > > > >>>> example where you've written a stellar function to normalize the > > input > > > >> and > > > >>>> you're then passing it to a CSV parser, you could run > > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise > specify a > > > >>>> parser. > > > >>>> > > > >>>> As for speed, the stellar expression would get compiled into a > java > > > >> object, > > > >>>> so it shouldn't be appreciable overhead since we no longer lex and > > > parse > > > >>>> for every message. > > > >>>> > > > >>>> Is this kinda how you were seeing it? > > > >>>> > > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball < > > > >>>> [email protected]> wrote: > > > >>>> > > > >>>>> The challenge there I suspect is going to be that you essentially > > end > > > >> up > > > >>>>> with the actual parser doing very little of value, and then > > > effectively > > > >>>>> trying to write a parser in stellar against a few broad strings, > > > which > > > >>>>> would likely give you all sorts of performance problems. > > > >>>>> > > > >>>>> One solution is to write a very defensive and flexible parser, > but > > > that > > > >>>>> would tend to be time consuming. > > > >>>>> > > > >>>>> There is also something to be said for doing some basic > > > transformation > > > >>>>> before the parser topic kafka in something like nifi, but again, > > > >>>>> performance can be an issue there. > > > >>>>> > > > >>>>> If the noise is about broken structure for example, maybe a > simple > > > >>>>> pre-process step as part of your parser would make sense, e.g. > > > >> stripping > > > >>>>> syslog headers, or character set conversion, removing very broken > > > bits > > > >> as > > > >>>>> part of the parse method. > > > >>>>> > > > >>>>> In terms of normalisation post-parse, I agree, that 100% a job > for > > > >>>>> Stellar, and the fieldTransformations capability. Something I > would > > > >> like > > > >>>> to > > > >>>>> see would be a means to use that transformation step to map to a > > well > > > >>>> known > > > >>>>> (though loosely enforced) schema provided by a governance > > framework, > > > >> but > > > >>>>> that is a much bigger topic of conversation. > > > >>>>> > > > >>>>> Not of course that not everything has to be parsed just because > > it’s > > > in > > > >>>>> the message. A relatively loose fitting parser which pulls out > the > > > >>>> relevant > > > >>>>> data for the use case would be fine, and likely a lot more > tolerant > > > of > > > >>>>> noise than something that felt the need for every field. We do > > after > > > >> all > > > >>>>> store the original_string for you if you really absolutely have > to > > > had > > > >>>>> everything, so a more schema-on-read philosophy certainly applies > > and > > > >>>> will > > > >>>>> likely side-step a lot of your issues. > > > >>>>> > > > >>>>> Simon > > > >>>>> > > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]> > > wrote: > > > >>>>>> > > > >>>>>> Ok, that's another story. hmmmm, we don't generally pre-parse > > > becuase > > > >>>> we > > > >>>>>> try to not assume any particular format there (i.e. it could be > > > >>>> strings, > > > >>>>>> could be byte arrays). Maybe the right answer is to pass the > raw, > > > >>>>>> non-normalized data (best effort tyep of thing) through the > parser > > > and > > > >>>> do > > > >>>>>> the normalization post-parse..or is there a problem with that? > > > >>>>>> > > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian < > > > [email protected]> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>>> Hi Casey, > > > >>>>>>> > > > >>>>>>> It is actually pre-parse process, not a post-parse one. These > > type > > > of > > > >>>>>>> noises affect the position of an attribute for example and give > > us > > > >>>>> parsing > > > >>>>>>> exception. The timestamp example was not a good one because > that > > is > > > >>>>>>> actually a post-parse exception. > > > >>>>>>> > > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella < > > [email protected] > > > > > > > >>>>> wrote: > > > >>>>>>> > > > >>>>>>>> So, further transformation post-parse was one of the > motivating > > > >>>> reasons > > > >>>>>>> for > > > >>>>>>>> Stellar (to do that transformation post-parse). Is there a > > > >>>> capability > > > >>>>>>> that > > > >>>>>>>> it's lacking that we can add to fit your usecase? > > > >>>>>>>> > > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian < > > > >> [email protected] > > > >>>>> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> I've created a Jira ticket regarding this feature. > > > >>>>>>>>> > > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893 > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian < > > > >>>> [email protected] > > > >>>>>> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Currently, we are using normal regex at the Java source code > > to > > > >>>>>>> handle > > > >>>>>>>>>> those situations. However, it would be nice to have a > separate > > > >> bolt > > > >>>>>>> and > > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue > > > >>>> regarding > > > >>>>>>>>> that. > > > >>>>>>>>>> The main reason I am asking for such a feature is the fact > > that > > > >>>> lack > > > >>>>>>> of > > > >>>>>>>>>> such a feature makes the process of creating some parser for > > the > > > >>>>>>>>> community > > > >>>>>>>>>> a little painful for us. We need to maintain two different > > > >>>> versions, > > > >>>>>>>> one > > > >>>>>>>>>> for community another for the internal use case. Clearly, > > noise > > > is > > > >>>> an > > > >>>>>>>>>> inevitable part of real world use cases. > > > >>>>>>>>>> > > > >>>>>>>>>> Cheers, > > > >>>>>>>>>> Ali > > > >>>>>>>>>> > > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler < > > > >>>>>>> [email protected] > > > >>>>>>>>> > > > >>>>>>>>>> wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hi, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you > > using > > > >>>> any > > > >>>>>>>>>>> Stellar to do it? > > > >>>>>>>>>>> Can you create a jira? > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian ( > > > >>>> [email protected]) > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> Hi all, > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> We are facing certain use cases in Metron production that > > > happen > > > >>>> to > > > >>>>>>> be > > > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp, > > > >> duplicate > > > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we > > > have > > > >>>>>>> added > > > >>>>>>>>> an > > > >>>>>>>>>>> additional step for the corresponding parsers to do the > data > > > >>>>>>> cleaning. > > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly > related > > > to > > > >>>> the > > > >>>>>>>>>>> device > > > >>>>>>>>>>> that is generating the data and can be used for the same > type > > > of > > > >>>>>>>> device > > > >>>>>>>>>>> everywhere, but normalization is very production dependent > > and > > > >>>> there > > > >>>>>>>> is > > > >>>>>>>>>>> no > > > >>>>>>>>>>> point of mixing normalization with parsing. It would be > nice > > to > > > >>>>>>> have a > > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to > > production > > > >>>>>>>>>>> related cleaning process. In that case, eveybody can easily > > > >>>>>>> contribute > > > >>>>>>>>> to > > > >>>>>>>>>>> Metron community with additional parsers without being > > worried > > > >>>> about > > > >>>>>>>>>>> mixing > > > >>>>>>>>>>> parsers and data cleaning process. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> Regards, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Ali > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> -- > > > >>>>>>>>>> A.Nazemian > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> A.Nazemian > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> -- > > > >>>>>>> A.Nazemian > > > >>>>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>> > > > >>> > > > >>> > > > >>> -- > > > >>> A.Nazemian > > > >> > > > >> > > > > > > > > > > > > -- > > > > A.Nazemian > > > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian
