Thanks Desheng! Could you create an issue to track it?

史德昇 <[email protected]> 于2025年5月12日周一 14:25写道:

>
> Dear Apache SeaTunnel community members:
>
> Hello!
>       I am Shi Desheng, a big data engineer who has been following the
> development of the SeaTunnel project for a long time. Through the use and
> research of the project, it was found that SeaTunnel is unable to parse and
> store complex data scenarios, such as host logs, program run logs, and
> other irregular log formats; It may lead potential users who need this
> feature to give up using SeaTunnel, so I conducted secondary development
> and built a log parsing RegexParseTransform plugin based on regular
> expressions, which can parse irregular logs into structured logs. I am
> willing to contribute to the project and contribute this feature to the
> open source community, working together with SeaTunnel open source
> community members to accelerate the sprint to become the world's top open
> source data synchronization tool.
>
> *1. Background and Motivation*
>
> With the continuous complexity of data sources and the rapid evolution of
> business requirements, universal data integration frameworks often face
> many challenges in the actual implementation process. Among them, SeaTunnel
> lacks universal parsing capabilities when dealing with raw logs with
> variable, irregular, and even deeply nested data formats (such as
> Apache/Nginx access logs, Linux Host logs, system syslogs, and custom
> program print logs). And these data are precisely an indispensable part of
> enterprise data governance and real-time monitoring. Therefore, improving
> SeaTunnel's ability to parse irregular logs and adding the 
> *RegexParseTransform
> plugin* can not only expand its application scenarios, but also enhance
> its competitiveness in areas such as log analysis and observability
> platforms.
> ------------------------------
> *2. Goals*
>
>    -
>
>    Provide a *Transform plugin named RegexParseTransform* for parsing
>    irregular logs into structured logs.
>    -
>
>    Support for *parsing irregular logs*, such as:
>    -
>
>       Apache/Nginx access logs
>       -
>
>       Linux Host logs
>       -
>
>       system syslogs
>       -
>
>       custom program print logs
>       -
>
>    Support *multiple key information extraction*.
>    -
>
>    Support *retaining original logs*.
>    -
>
>    Support one configuration to *parse an entire irregular log*.
>    -
>
>    Compatible with both *BATCH* and *STREAMING* job modes.
>    -
>
>    Transparent integration with all connector-v2 pipelines (no need to
>    modify source/sink plugins).
>
> ------------------------------
> *3. Design Overview**3.1 Plugin Configuration*
>
> transform {
>   # 样例数据
>   # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap
>   RegexParse {
>     regex_parse_field= "value"
>     regex = 
> """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
>     groupMap = {
>         what_log_orig:0           # 192.168.73.1~<37>Apr 13 16:27:33 asap91 
> su: FAILED asap
>         device_ip:1               # 192.168.73.1
>         operation_time:2          # Apr 13 16:27:33
>         device_name:3             # asap91
>         operation_command:4       # su
>         how_op_res:5              # FAILED
>         slave_account_name:6      # asap
>     }
>  }
> }
>
>
>    -
>
>    regex_parse_field: Upstream fields that require parsing.
>    -
>
>    regex: regular expression.
>    -
>
>    groupMap: The correspondence between result fields and regular capture
>    group indexes.
>
> *3.2 Execution Logic*
>
>    -
>
>    When starting a job, RegexParseTransform will:
>    1.
>
>       Construct parameters in RegexParseTransform to obtain the values of
>       three core parameters and the index of the regex_parse_fieldfield.
>       -
>
>    On receiving a record, the RegexParseTransform will:
>    1.
>
>       Retrieve the data value corresponding to regex_marse_field, perform
>       regular matching and logical verification.
>       2.
>
>       Traverse the corresponding relationship values of the groupMap
>       field, capture the group index, and extract the corresponding captured
>       group content of the regular matcher matcher.
>       3.
>
>       Assemble the extracted result values with the original values and
>       pass them downstream.On receiving a record, the RegexParseTransform
>       will:
>       -
>
>    When converting data structures, RegexParseTransform will:
>    1.
>
>       Traverse the relationship values corresponding to the groupMap
>       fields, and by default, set the data structure type of all result 
> fields to
>       String (or convert them to their respective types based on their
>       data format)
>       -
>
>    This transformation is implemented by inheriting SeaTunnel's
>    AbstractCatalogSupportTransform API and will be fully parallel..
>
> ------------------------------
> *4. Implementation Plan*
> TaskDescription
> Phase 1 Support extracting multiple key information through
> regularization.
> Phase 2 Support converting them into their respective types based on data
> format.
> Phase 3 Add test coverage (unit + e2e) and documentation on website
> *5. End*
>
>        I sincerely appreciate the opportunity to contribute to the Apache
> SeaTunnel open source project. The attachment is my proposal. Thank you for
> taking the time to review it amidst your busy schedule. Looking forward
> to your reply.
>
>       Best regards,
>
>       Shi Desheng
>

Reply via email to