[ https://issues.apache.org/jira/browse/METRON-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661739#comment-16661739 ]
ASF GitHub Bot commented on METRON-1795: ---------------------------------------- Github user jagdeepsingh2 closed the pull request at: https://github.com/apache/metron/pull/1214 > General Purpose Regex Parser > ---------------------------- > > Key: METRON-1795 > URL: https://issues.apache.org/jira/browse/METRON-1795 > Project: Metron > Issue Type: New Feature > Reporter: Jagdeep Singh > Priority: Minor > > We have implemented a general purpose regex parser for Metron that we are > interested in contributing back to the community. > > While the Metron Grok parser provides some regex based capability today, the > intention of this general purpose regex parser is to: > # Allow for more advanced parsing scenarios (specifically, dealing with > multiple regex lines for devices that contain several log formats within them) > # Give users and developers of Metron additional options for parsing > # With the new parser chaining and regex routing feature available in > Metron, this gives some additional flexibility to logically separate a flow > by: > # Regex routing to segregate logs at a device level and handle envelope > unwrapping > # This general purpose regex parser to parse an entire device type that > contains multiple log formats within the single device (for example, RHEL > logs) > At the high-level control flow is like this: > # Identify the record type if incoming raw message. > # Find and apply the regular expression of corresponding record type to > extract the fields (using named groups). > # Apply the message header regex to extract the fields in the header part of > the message (using named groups). > > The parser config uses the following structure: > > {code:java} > "recordTypeRegex": "(?<process>(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))" > "messageHeaderRegex": > "(?<syslogpriority>(?<=^<)\\d{1,4}(?=>)).*?(?<timestamp>(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?<syslogHost>(?<=\\s).*?(?=\\s))", > "fields": [ > { > "recordType": "kernel", > "regex": ".*(?<eventInfo>(?<=\\]|\\w\\:).*?(?=$))" > }, > { > "recordType": "syslog", > "regex": > ".*(?<processid>(?<=PID\\s=\\s).*?(?=\\sLine)).*(?<filePath>(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?<fileName>.*?(?=\")).*(?<eventInfo>(?<=\").*?(?=$))" > } > ] > {code} > > Where: > * *recordTypeRegex* is used to distinctly identify a record type. It inputs > a valid regular expression and may also have named groups, which would be > extracted into fields. > * *messageHeaderRegex* is used to specify a regular expression to extract > fields from a message part which is common across all the messages (i.e, > syslog fields, standard headers) > * *fields*: json list of objects containing recordType and regex. The > expression that is evaluated is based on the output of the recordTypeRegex > * Note: *recordTypeRegex* and *messageHeaderRegex* could be specified as > lists also (as a JSON array), where the list will be evaluated in order until > a matching regular expression is found. -- This message was sent by Atlassian JIRA (v7.6.3#76005)