Before worrying about how to ingest this 'noisy' data, I would want to better understand root cause. If you cannot even get a valid date format, are you sure the data can be trusted?
Rather than bending over backwards to try to ingest it, I would first make sure the telemetry is not totally bogus to begin with. Maybe it is better that the data is dropped in cases like this. IMHO, that is how I would tackle a problem like this. Not all data can be trusted. On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian <[email protected]> wrote: > Are you sure? The syslog_host name is way more complicated than something > that can be a coincidence. I need to double check with one of the security > device experts, but I thought it is some kind of noises. > > Yes, we do have more use cases that seem to be corrupted. For example, > having duplicate IP addresses or corrupted date format. Please have a look > at the following message. At least I am sure the date format is corrupted > in this one. > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP connection > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2* > *y.y.y.y/p2* > > Cheers, > Ali > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball < > [email protected]> wrote: > > > Is that instance, you're looking at valid syslog which should be parsed > as > > such. The repeat host is not really a host in syslog terms, it's an > > application name header which happens to be the same. This is definitely > a > > parser bug which should be handled, esp since the header is perfectly RFC > > compliant. > > > > Do you have any other such cases? My view is that parsers should be > > written with more any case, so should extract all the fields they can > from > > malformed logs, rather than throwing exceptions, but that's more about > the > > way we write parsers than having some kind of pre-clean. > > > > Simon > > > > Sent from my iPad > > > > > On 27 Apr 2017, at 08:04, Ali Nazemian <[email protected]> wrote: > > > > > > I do agree there is a fair amount of overhead for using another bolt > for > > > this purpose. I am not pointing to the way of implementation. It might > > be a > > > way of implementation to segregate two extension points without adding > > > overhead; I haven't thought about it yet. However, the main issue is > > > sometimes the type of noise is something that generates an exception on > > the > > > parsing side. For example, have a look at the following log: > > > > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0 > > > (ryanmar) > > > > > > Clearly duplicate syslog_host throws an exception on parsing, so how > > > are we going to deal with that at post-parse transformation? It cannot > > > pass the parsing. This is only a single example of cases that might > > > affect the production data. Unless Stellar transformation is something > > > that can be done at pre-parse and for the entire message. > > > > > > > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball < > > > [email protected]> wrote: > > > > > >> Ali, > > >> > > >> Sounds very much like what you’re talking about when you say > > >> normalization, and what I would understand it as, is the process > > fulfilled > > >> by stellar field transformation in the parser config. Agreed that some > > of > > >> these will be general, based on common metron standard schema, but > > others > > >> will be organisation specific (custom fields overloaded with different > > >> meanings for instance in CEF, for example). These are very much one of > > the > > >> reasons we have the stellar transformation step. I don’t think that > > should > > >> be moved to a separate bolt to be honest, because that comes with a > fair > > >> amount of overhead, but logically it is in the parser config rather > than > > >> the parser, so seems to serve this purpose in the post-parse > transform, > > no? > > >> > > >> Simon > > >> > > >> > > >> > > >>> On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]> > wrote: > > >>> > > >>> Hi Simon, > > >>> > > >>> The reason I am asking for a specific normalisation step is due to > the > > >> fact > > >>> that normalisation is not a general use case which can be used by > other > > >>> users. It is completely bounded to our application. The way we have > > fixed > > >>> it, for now, is to add a normalisation step to the parser and clear > the > > >>> incoming data so the parser step can work on that, but I don't like > it. > > >>> There is no point of creating a parser that can handle all of the > > >> possible > > >>> noises that can exist in the production data. Even if it is possible > to > > >>> predict every kind of noise in production data there is no point for > > >> Metron > > >>> community to focus on building a general purpose parser for a > specific > > >>> device while they can spend that time on developing a cool feature. > > Even > > >> if > > >>> it is possible to predict noises and it is acceptable for the > community > > >> to > > >>> spend their time on creating that kind of parser why every Metron > user > > >> need > > >>> that extra normalisation? A user data might be clear at the first > step > > >> and > > >>> obviously, it only decreases the total throughput without any use for > > >> that > > >>> specific user. > > >>> > > >>> Imagine there is an additional bolt for normalisation and there is a > > >>> mechanism to customise the normalisation without changing the general > > >>> parser for a specific device. We can have a general parser as a > common > > >>> parser for that device and leave the normalisation development to > > users. > > >>> However, it is very important to provide the normalisation step as > fast > > >> as > > >>> possible. > > >>> > > >>> Cheers, > > >>> Ali > > >>> > > >>> On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected]> > > >> wrote: > > >>> > > >>>> Yeah, we definitely don't want to rewrite parsing in Stellar. I > would > > >>>> expect the job of the parser, however, to handle structural issues. > > In > > >> my > > >>>> mind, parsing is about transforming structures into fields and the > > role > > >> of > > >>>> the field transformations are to transform values. There's obvious > > >> overlap > > >>>> there wherein parsers may do some normalizations/transformations > (i.e. > > >> look > > >>>> how grok handles timestamps), but it almost always gets us into > > trouble > > >>>> when parsers do even moderately complex value transformations. > > >>>> > > >>>> As I type this, though, I think I see your point. What you really > > want > > >> is > > >>>> to chain parsers, have a pre-parser to bring you 80% of the way > there > > >> and > > >>>> hammer out all the structural issues so you might be able to use a > > more > > >>>> generic parser down the chain. I have often thought that maybe we > > >> should > > >>>> expose parsers as Stellar functions which take raw data and emit > whole > > >>>> messages. This would allow us to compose parsers, so imagine the > > above > > >>>> example where you've written a stellar function to normalize the > input > > >> and > > >>>> you're then passing it to a CSV parser, you could run > > >>>> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a > > >>>> parser. > > >>>> > > >>>> As for speed, the stellar expression would get compiled into a java > > >> object, > > >>>> so it shouldn't be appreciable overhead since we no longer lex and > > parse > > >>>> for every message. > > >>>> > > >>>> Is this kinda how you were seeing it? > > >>>> > > >>>> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball < > > >>>> [email protected]> wrote: > > >>>> > > >>>>> The challenge there I suspect is going to be that you essentially > end > > >> up > > >>>>> with the actual parser doing very little of value, and then > > effectively > > >>>>> trying to write a parser in stellar against a few broad strings, > > which > > >>>>> would likely give you all sorts of performance problems. > > >>>>> > > >>>>> One solution is to write a very defensive and flexible parser, but > > that > > >>>>> would tend to be time consuming. > > >>>>> > > >>>>> There is also something to be said for doing some basic > > transformation > > >>>>> before the parser topic kafka in something like nifi, but again, > > >>>>> performance can be an issue there. > > >>>>> > > >>>>> If the noise is about broken structure for example, maybe a simple > > >>>>> pre-process step as part of your parser would make sense, e.g. > > >> stripping > > >>>>> syslog headers, or character set conversion, removing very broken > > bits > > >> as > > >>>>> part of the parse method. > > >>>>> > > >>>>> In terms of normalisation post-parse, I agree, that 100% a job for > > >>>>> Stellar, and the fieldTransformations capability. Something I would > > >> like > > >>>> to > > >>>>> see would be a means to use that transformation step to map to a > well > > >>>> known > > >>>>> (though loosely enforced) schema provided by a governance > framework, > > >> but > > >>>>> that is a much bigger topic of conversation. > > >>>>> > > >>>>> Not of course that not everything has to be parsed just because > it’s > > in > > >>>>> the message. A relatively loose fitting parser which pulls out the > > >>>> relevant > > >>>>> data for the use case would be fine, and likely a lot more tolerant > > of > > >>>>> noise than something that felt the need for every field. We do > after > > >> all > > >>>>> store the original_string for you if you really absolutely have to > > had > > >>>>> everything, so a more schema-on-read philosophy certainly applies > and > > >>>> will > > >>>>> likely side-step a lot of your issues. > > >>>>> > > >>>>> Simon > > >>>>> > > >>>>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]> > wrote: > > >>>>>> > > >>>>>> Ok, that's another story. hmmmm, we don't generally pre-parse > > becuase > > >>>> we > > >>>>>> try to not assume any particular format there (i.e. it could be > > >>>> strings, > > >>>>>> could be byte arrays). Maybe the right answer is to pass the raw, > > >>>>>> non-normalized data (best effort tyep of thing) through the parser > > and > > >>>> do > > >>>>>> the normalization post-parse..or is there a problem with that? > > >>>>>> > > >>>>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian < > > [email protected]> > > >>>>> wrote: > > >>>>>> > > >>>>>>> Hi Casey, > > >>>>>>> > > >>>>>>> It is actually pre-parse process, not a post-parse one. These > type > > of > > >>>>>>> noises affect the position of an attribute for example and give > us > > >>>>> parsing > > >>>>>>> exception. The timestamp example was not a good one because that > is > > >>>>>>> actually a post-parse exception. > > >>>>>>> > > >>>>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella < > [email protected] > > > > > >>>>> wrote: > > >>>>>>> > > >>>>>>>> So, further transformation post-parse was one of the motivating > > >>>> reasons > > >>>>>>> for > > >>>>>>>> Stellar (to do that transformation post-parse). Is there a > > >>>> capability > > >>>>>>> that > > >>>>>>>> it's lacking that we can add to fit your usecase? > > >>>>>>>> > > >>>>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian < > > >> [email protected] > > >>>>> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> I've created a Jira ticket regarding this feature. > > >>>>>>>>> > > >>>>>>>>> https://issues.apache.org/jira/browse/METRON-893 > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian < > > >>>> [email protected] > > >>>>>> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Currently, we are using normal regex at the Java source code > to > > >>>>>>> handle > > >>>>>>>>>> those situations. However, it would be nice to have a separate > > >> bolt > > >>>>>>> and > > >>>>>>>>>> deal with them separately. Yeah, I can create a Jira issue > > >>>> regarding > > >>>>>>>>> that. > > >>>>>>>>>> The main reason I am asking for such a feature is the fact > that > > >>>> lack > > >>>>>>> of > > >>>>>>>>>> such a feature makes the process of creating some parser for > the > > >>>>>>>>> community > > >>>>>>>>>> a little painful for us. We need to maintain two different > > >>>> versions, > > >>>>>>>> one > > >>>>>>>>>> for community another for the internal use case. Clearly, > noise > > is > > >>>> an > > >>>>>>>>>> inevitable part of real world use cases. > > >>>>>>>>>> > > >>>>>>>>>> Cheers, > > >>>>>>>>>> Ali > > >>>>>>>>>> > > >>>>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler < > > >>>>>>> [email protected] > > >>>>>>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Hi, > > >>>>>>>>>>> > > >>>>>>>>>>> Are you doing this cleansing all in the parser or are you > using > > >>>> any > > >>>>>>>>>>> Stellar to do it? > > >>>>>>>>>>> Can you create a jira? > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian ( > > >>>> [email protected]) > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi all, > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> We are facing certain use cases in Metron production that > > happen > > >>>> to > > >>>>>>> be > > >>>>>>>>>>> related to noisy stream. For example, a wrong timestamp, > > >> duplicate > > >>>>>>>>>>> hostname/IP address, etc. To deal with the normalization we > > have > > >>>>>>> added > > >>>>>>>>> an > > >>>>>>>>>>> additional step for the corresponding parsers to do the data > > >>>>>>> cleaning. > > >>>>>>>>>>> Clearly, parsing is a standard factor which is mostly related > > to > > >>>> the > > >>>>>>>>>>> device > > >>>>>>>>>>> that is generating the data and can be used for the same type > > of > > >>>>>>>> device > > >>>>>>>>>>> everywhere, but normalization is very production dependent > and > > >>>> there > > >>>>>>>> is > > >>>>>>>>>>> no > > >>>>>>>>>>> point of mixing normalization with parsing. It would be nice > to > > >>>>>>> have a > > >>>>>>>>>>> sperate bolt in a parsing topologies to dedicate to > production > > >>>>>>>>>>> related cleaning process. In that case, eveybody can easily > > >>>>>>> contribute > > >>>>>>>>> to > > >>>>>>>>>>> Metron community with additional parsers without being > worried > > >>>> about > > >>>>>>>>>>> mixing > > >>>>>>>>>>> parsers and data cleaning process. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Regards, > > >>>>>>>>>>> > > >>>>>>>>>>> Ali > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> -- > > >>>>>>>>>> A.Nazemian > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> A.Nazemian > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> A.Nazemian > > >>>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >>> > > >>> -- > > >>> A.Nazemian > > >> > > >> > > > > > > > > > -- > > > A.Nazemian > > > > > > -- > A.Nazemian >
