Github user jagdeepsingh2 commented on a diff in the pull request:

    https://github.com/apache/metron/pull/1245#discussion_r237715210
  
    --- Diff: metron-platform/metron-parsers/README.md ---
    @@ -52,6 +52,62 @@ There are two general types types of parsers:
            This is using the default value for `wrapEntityName` if that 
property is not set.
         * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
         * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
    +  * Regular Expressions Parser
    +      * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
    +      * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
    +      * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
    +          For example, following convertions will automatically happen:
    +
    +          ```
    +          ipSrcAddr -> ip_src_addr
    +          ipDstAddr -> ip_dst_addr
    +          ipSrcPort -> ip_src_port
    +          ```
    +          Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
    +          
    +      * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
    +      
    +      A complete configuration example would look like:
    +      
    +      ```json
    +      "convertCamelCaseToUnderScore": true, 
    +      "recordTypeRegex": "kernel|syslog",
    +      "messageHeaderRegex": 
"(<syslogPriority>(<=^&lt;)\\d{1,4}(?=>)).*?(<timestamp>(<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(<syslogHost>(<=\\s).*?(?=\\s))",
    +      "fields": [
    +        {
    +          "recordType": "kernel",
    +          "regex": ".*(<eventInfo>(<=\\]|\\w\\:).*?(?=$))"
    +        },
    +        {
    +          "recordType": "syslog",
    +          "regex": 
".*(<processid>(<=PID\\s=\\s).*?(?=\\sLine)).*(<filePath>(<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))
        (<fileName>.*?(?=\")).*(<eventInfo>(<=\").*?(?=$))"
    +        }
    +      ]
    +      ```
    +      **Note**: messageHeaderRegex and regex (withing fields) could be 
specified as lists also e.g.
    +      ```json
    +          "messageHeaderRegex": [
    +          "regular expression 1",
    +          "regular expression 2"
    +          ]
    +      ```
    +      Where **regular expression 1** are valid regular expressions and may 
have named
    +      groups, which would be extracted into fields. This list will be 
evaluated in order until a
    +      matching regular expression is found.
    +      
    +      **recordTypeRegex** can be a more advanced regular expression 
containing named goups. For example
    --- End diff --
    
    Though having named group in recordType is completely optional, still you 
could want to use a namedGroup in recordType for followring reasons:
    
    1. Since **recordType** regular expression is already getting matched and 
we are paying the price for a regular expression match already, we can extract 
certain fields as a by product of this match.
    2. Most likely the recordType field is common across all the messages. 
Hence having it extracted in the **recordType** (or **messageHeaderRegex**) 
would reduce the overall complexity of regular expressions in the **regex** 
field.
    
    Again, it is a personal choice on how to craft your parser configuration. 
These are just the options given to user.


---

Reply via email to