Thanks Desheng! Could you create an issue to track it?
史德昇 <[email protected]> 于2025年5月12日周一 14:25写道:
>
> Dear Apache SeaTunnel community members:
>
> Hello!
> I am Shi Desheng, a big data engineer who has been following the
> development of the SeaTunnel project for a long time. Through the use and
> research of the project, it was found that SeaTunnel is unable to parse and
> store complex data scenarios, such as host logs, program run logs, and
> other irregular log formats; It may lead potential users who need this
> feature to give up using SeaTunnel, so I conducted secondary development
> and built a log parsing RegexParseTransform plugin based on regular
> expressions, which can parse irregular logs into structured logs. I am
> willing to contribute to the project and contribute this feature to the
> open source community, working together with SeaTunnel open source
> community members to accelerate the sprint to become the world's top open
> source data synchronization tool.
>
> *1. Background and Motivation*
>
> With the continuous complexity of data sources and the rapid evolution of
> business requirements, universal data integration frameworks often face
> many challenges in the actual implementation process. Among them, SeaTunnel
> lacks universal parsing capabilities when dealing with raw logs with
> variable, irregular, and even deeply nested data formats (such as
> Apache/Nginx access logs, Linux Host logs, system syslogs, and custom
> program print logs). And these data are precisely an indispensable part of
> enterprise data governance and real-time monitoring. Therefore, improving
> SeaTunnel's ability to parse irregular logs and adding the
> *RegexParseTransform
> plugin* can not only expand its application scenarios, but also enhance
> its competitiveness in areas such as log analysis and observability
> platforms.
> ------------------------------
> *2. Goals*
>
> -
>
> Provide a *Transform plugin named RegexParseTransform* for parsing
> irregular logs into structured logs.
> -
>
> Support for *parsing irregular logs*, such as:
> -
>
> Apache/Nginx access logs
> -
>
> Linux Host logs
> -
>
> system syslogs
> -
>
> custom program print logs
> -
>
> Support *multiple key information extraction*.
> -
>
> Support *retaining original logs*.
> -
>
> Support one configuration to *parse an entire irregular log*.
> -
>
> Compatible with both *BATCH* and *STREAMING* job modes.
> -
>
> Transparent integration with all connector-v2 pipelines (no need to
> modify source/sink plugins).
>
> ------------------------------
> *3. Design Overview**3.1 Plugin Configuration*
>
> transform {
> # 样例数据
> # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap
> RegexParse {
> regex_parse_field= "value"
> regex =
> """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
> groupMap = {
> what_log_orig:0 # 192.168.73.1~<37>Apr 13 16:27:33 asap91
> su: FAILED asap
> device_ip:1 # 192.168.73.1
> operation_time:2 # Apr 13 16:27:33
> device_name:3 # asap91
> operation_command:4 # su
> how_op_res:5 # FAILED
> slave_account_name:6 # asap
> }
> }
> }
>
>
> -
>
> regex_parse_field: Upstream fields that require parsing.
> -
>
> regex: regular expression.
> -
>
> groupMap: The correspondence between result fields and regular capture
> group indexes.
>
> *3.2 Execution Logic*
>
> -
>
> When starting a job, RegexParseTransform will:
> 1.
>
> Construct parameters in RegexParseTransform to obtain the values of
> three core parameters and the index of the regex_parse_fieldfield.
> -
>
> On receiving a record, the RegexParseTransform will:
> 1.
>
> Retrieve the data value corresponding to regex_marse_field, perform
> regular matching and logical verification.
> 2.
>
> Traverse the corresponding relationship values of the groupMap
> field, capture the group index, and extract the corresponding captured
> group content of the regular matcher matcher.
> 3.
>
> Assemble the extracted result values with the original values and
> pass them downstream.On receiving a record, the RegexParseTransform
> will:
> -
>
> When converting data structures, RegexParseTransform will:
> 1.
>
> Traverse the relationship values corresponding to the groupMap
> fields, and by default, set the data structure type of all result
> fields to
> String (or convert them to their respective types based on their
> data format)
> -
>
> This transformation is implemented by inheriting SeaTunnel's
> AbstractCatalogSupportTransform API and will be fully parallel..
>
> ------------------------------
> *4. Implementation Plan*
> TaskDescription
> Phase 1 Support extracting multiple key information through
> regularization.
> Phase 2 Support converting them into their respective types based on data
> format.
> Phase 3 Add test coverage (unit + e2e) and documentation on website
> *5. End*
>
> I sincerely appreciate the opportunity to contribute to the Apache
> SeaTunnel open source project. The attachment is my proposal. Thank you for
> taking the time to review it amidst your busy schedule. Looking forward
> to your reply.
>
> Best regards,
>
> Shi Desheng
>