I do agree there is a fair amount of overhead for using another bolt for this purpose. I am not pointing to the way of implementation. It might be a way of implementation to segregate two extension points without adding overhead; I haven't thought about it yet. However, the main issue is sometimes the type of noise is something that generates an exception on the parsing side. For example, have a look at the following log:
<166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0 (ryanmar) Clearly duplicate syslog_host throws an exception on parsing, so how are we going to deal with that at post-parse transformation? It cannot pass the parsing. This is only a single example of cases that might affect the production data. Unless Stellar transformation is something that can be done at pre-parse and for the entire message. On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball < [email protected]> wrote: > Ali, > > Sounds very much like what you’re talking about when you say > normalization, and what I would understand it as, is the process fulfilled > by stellar field transformation in the parser config. Agreed that some of > these will be general, based on common metron standard schema, but others > will be organisation specific (custom fields overloaded with different > meanings for instance in CEF, for example). These are very much one of the > reasons we have the stellar transformation step. I don’t think that should > be moved to a separate bolt to be honest, because that comes with a fair > amount of overhead, but logically it is in the parser config rather than > the parser, so seems to serve this purpose in the post-parse transform, no? > > Simon > > > > > On 27 Apr 2017, at 02:08, Ali Nazemian <[email protected]> wrote: > > > > Hi Simon, > > > > The reason I am asking for a specific normalisation step is due to the > fact > > that normalisation is not a general use case which can be used by other > > users. It is completely bounded to our application. The way we have fixed > > it, for now, is to add a normalisation step to the parser and clear the > > incoming data so the parser step can work on that, but I don't like it. > > There is no point of creating a parser that can handle all of the > possible > > noises that can exist in the production data. Even if it is possible to > > predict every kind of noise in production data there is no point for > Metron > > community to focus on building a general purpose parser for a specific > > device while they can spend that time on developing a cool feature. Even > if > > it is possible to predict noises and it is acceptable for the community > to > > spend their time on creating that kind of parser why every Metron user > need > > that extra normalisation? A user data might be clear at the first step > and > > obviously, it only decreases the total throughput without any use for > that > > specific user. > > > > Imagine there is an additional bolt for normalisation and there is a > > mechanism to customise the normalisation without changing the general > > parser for a specific device. We can have a general parser as a common > > parser for that device and leave the normalisation development to users. > > However, it is very important to provide the normalisation step as fast > as > > possible. > > > > Cheers, > > Ali > > > > On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella <[email protected]> > wrote: > > > >> Yeah, we definitely don't want to rewrite parsing in Stellar. I would > >> expect the job of the parser, however, to handle structural issues. In > my > >> mind, parsing is about transforming structures into fields and the role > of > >> the field transformations are to transform values. There's obvious > overlap > >> there wherein parsers may do some normalizations/transformations (i.e. > look > >> how grok handles timestamps), but it almost always gets us into trouble > >> when parsers do even moderately complex value transformations. > >> > >> As I type this, though, I think I see your point. What you really want > is > >> to chain parsers, have a pre-parser to bring you 80% of the way there > and > >> hammer out all the structural issues so you might be able to use a more > >> generic parser down the chain. I have often thought that maybe we > should > >> expose parsers as Stellar functions which take raw data and emit whole > >> messages. This would allow us to compose parsers, so imagine the above > >> example where you've written a stellar function to normalize the input > and > >> you're then passing it to a CSV parser, you could run > >> "CSV_PARSE(ALI_NORMALIZE(message))" where you'd otherwise specify a > >> parser. > >> > >> As for speed, the stellar expression would get compiled into a java > object, > >> so it shouldn't be appreciable overhead since we no longer lex and parse > >> for every message. > >> > >> Is this kinda how you were seeing it? > >> > >> On Wed, Apr 26, 2017 at 9:51 AM, Simon Elliston Ball < > >> [email protected]> wrote: > >> > >>> The challenge there I suspect is going to be that you essentially end > up > >>> with the actual parser doing very little of value, and then effectively > >>> trying to write a parser in stellar against a few broad strings, which > >>> would likely give you all sorts of performance problems. > >>> > >>> One solution is to write a very defensive and flexible parser, but that > >>> would tend to be time consuming. > >>> > >>> There is also something to be said for doing some basic transformation > >>> before the parser topic kafka in something like nifi, but again, > >>> performance can be an issue there. > >>> > >>> If the noise is about broken structure for example, maybe a simple > >>> pre-process step as part of your parser would make sense, e.g. > stripping > >>> syslog headers, or character set conversion, removing very broken bits > as > >>> part of the parse method. > >>> > >>> In terms of normalisation post-parse, I agree, that 100% a job for > >>> Stellar, and the fieldTransformations capability. Something I would > like > >> to > >>> see would be a means to use that transformation step to map to a well > >> known > >>> (though loosely enforced) schema provided by a governance framework, > but > >>> that is a much bigger topic of conversation. > >>> > >>> Not of course that not everything has to be parsed just because it’s in > >>> the message. A relatively loose fitting parser which pulls out the > >> relevant > >>> data for the use case would be fine, and likely a lot more tolerant of > >>> noise than something that felt the need for every field. We do after > all > >>> store the original_string for you if you really absolutely have to had > >>> everything, so a more schema-on-read philosophy certainly applies and > >> will > >>> likely side-step a lot of your issues. > >>> > >>> Simon > >>> > >>>> On 26 Apr 2017, at 14:37, Casey Stella <[email protected]> wrote: > >>>> > >>>> Ok, that's another story. hmmmm, we don't generally pre-parse becuase > >> we > >>>> try to not assume any particular format there (i.e. it could be > >> strings, > >>>> could be byte arrays). Maybe the right answer is to pass the raw, > >>>> non-normalized data (best effort tyep of thing) through the parser and > >> do > >>>> the normalization post-parse..or is there a problem with that? > >>>> > >>>> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian <[email protected]> > >>> wrote: > >>>> > >>>>> Hi Casey, > >>>>> > >>>>> It is actually pre-parse process, not a post-parse one. These type of > >>>>> noises affect the position of an attribute for example and give us > >>> parsing > >>>>> exception. The timestamp example was not a good one because that is > >>>>> actually a post-parse exception. > >>>>> > >>>>> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella <[email protected]> > >>> wrote: > >>>>> > >>>>>> So, further transformation post-parse was one of the motivating > >> reasons > >>>>> for > >>>>>> Stellar (to do that transformation post-parse). Is there a > >> capability > >>>>> that > >>>>>> it's lacking that we can add to fit your usecase? > >>>>>> > >>>>>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian < > [email protected] > >>> > >>>>>> wrote: > >>>>>> > >>>>>>> I've created a Jira ticket regarding this feature. > >>>>>>> > >>>>>>> https://issues.apache.org/jira/browse/METRON-893 > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian < > >> [email protected] > >>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Currently, we are using normal regex at the Java source code to > >>>>> handle > >>>>>>>> those situations. However, it would be nice to have a separate > bolt > >>>>> and > >>>>>>>> deal with them separately. Yeah, I can create a Jira issue > >> regarding > >>>>>>> that. > >>>>>>>> The main reason I am asking for such a feature is the fact that > >> lack > >>>>> of > >>>>>>>> such a feature makes the process of creating some parser for the > >>>>>>> community > >>>>>>>> a little painful for us. We need to maintain two different > >> versions, > >>>>>> one > >>>>>>>> for community another for the internal use case. Clearly, noise is > >> an > >>>>>>>> inevitable part of real world use cases. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Ali > >>>>>>>> > >>>>>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler < > >>>>> [email protected] > >>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> Are you doing this cleansing all in the parser or are you using > >> any > >>>>>>>>> Stellar to do it? > >>>>>>>>> Can you create a jira? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian ( > >> [email protected]) > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi all, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> We are facing certain use cases in Metron production that happen > >> to > >>>>> be > >>>>>>>>> related to noisy stream. For example, a wrong timestamp, > duplicate > >>>>>>>>> hostname/IP address, etc. To deal with the normalization we have > >>>>> added > >>>>>>> an > >>>>>>>>> additional step for the corresponding parsers to do the data > >>>>> cleaning. > >>>>>>>>> Clearly, parsing is a standard factor which is mostly related to > >> the > >>>>>>>>> device > >>>>>>>>> that is generating the data and can be used for the same type of > >>>>>> device > >>>>>>>>> everywhere, but normalization is very production dependent and > >> there > >>>>>> is > >>>>>>>>> no > >>>>>>>>> point of mixing normalization with parsing. It would be nice to > >>>>> have a > >>>>>>>>> sperate bolt in a parsing topologies to dedicate to production > >>>>>>>>> related cleaning process. In that case, eveybody can easily > >>>>> contribute > >>>>>>> to > >>>>>>>>> Metron community with additional parsers without being worried > >> about > >>>>>>>>> mixing > >>>>>>>>> parsers and data cleaning process. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> > >>>>>>>>> Ali > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> A.Nazemian > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> A.Nazemian > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> A.Nazemian > >>>>> > >>> > >>> > >> > > > > > > > > -- > > A.Nazemian > > -- A.Nazemian
