good suggestion.


Best Regards

---------------
David
Linkedin: https://www.linkedin.com/in/davidzollo
---------------


On Mon, May 12, 2025 at 8:59 PM 史德昇 <[email protected]> wrote:

> No problem, I will create this issue and continue to track and complete it.
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "dev"
>                                                                 <
> [email protected]&gt;;
> 发送时间:&nbsp;2025年5月12日(星期一) 晚上8:00
> 收件人:&nbsp;"dev"<[email protected]&gt;;
>
> 主题:&nbsp;Re: Proposal:Add a 'RegexParseTransform' plugin in Apache
> SeaTunnel to parse irregular logs into structured logs
>
>
>
> Thanks Desheng! Could you create an issue to track it?
>
> 史德昇 <[email protected]&gt; 于2025年5月12日周一 14:25写道:
>
> &gt;
> &gt; Dear Apache SeaTunnel community members:
> &gt;
> &gt; Hello!
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I am Shi Desheng, a big data
> engineer who has been following the
> &gt; development of the SeaTunnel project for a long time. Through the use
> and
> &gt; research of the project, it was found that SeaTunnel is unable to
> parse and
> &gt; store complex data scenarios, such as host logs, program run logs, and
> &gt; other irregular log formats; It may lead potential users who need this
> &gt; feature to give up using SeaTunnel, so I conducted secondary
> development
> &gt; and built a log parsing RegexParseTransform plugin based on regular
> &gt; expressions, which can parse irregular logs into structured logs. I am
> &gt; willing to contribute to the project and contribute this feature to
> the
> &gt; open source community, working together with SeaTunnel open source
> &gt; community members to accelerate the sprint to become the world's top
> open
> &gt; source data synchronization tool.
> &gt;
> &gt; *1. Background and Motivation*
> &gt;
> &gt; With the continuous complexity of data sources and the rapid
> evolution of
> &gt; business requirements, universal data integration frameworks often
> face
> &gt; many challenges in the actual implementation process. Among them,
> SeaTunnel
> &gt; lacks universal parsing capabilities when dealing with raw logs with
> &gt; variable, irregular, and even deeply nested data formats (such as
> &gt; Apache/Nginx access logs, Linux Host logs, system syslogs, and custom
> &gt; program print logs). And these data are precisely an indispensable
> part of
> &gt; enterprise data governance and real-time monitoring. Therefore,
> improving
> &gt; SeaTunnel's ability to parse irregular logs and adding the
> *RegexParseTransform
> &gt; plugin* can not only expand its application scenarios, but also
> enhance
> &gt; its competitiveness in areas such as log analysis and observability
> &gt; platforms.
> &gt; ------------------------------
> &gt; *2. Goals*
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Provide a *Transform plugin named
> RegexParseTransform* for parsing
> &gt;&nbsp;&nbsp;&nbsp; irregular logs into structured logs.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Support for *parsing irregular logs*, such as:
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Apache/Nginx access logs
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Linux Host logs
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; system syslogs
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; custom program print logs
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Support *multiple key information extraction*.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Support *retaining original logs*.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Support one configuration to *parse an entire
> irregular log*.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Compatible with both *BATCH* and *STREAMING* job
> modes.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; Transparent integration with all connector-v2
> pipelines (no need to
> &gt;&nbsp;&nbsp;&nbsp; modify source/sink plugins).
> &gt;
> &gt; ------------------------------
> &gt; *3. Design Overview**3.1 Plugin Configuration*
> &gt;
> &gt; transform {
> &gt;&nbsp;&nbsp; # 样例数据
> &gt;&nbsp;&nbsp; # 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED
> asap
> &gt;&nbsp;&nbsp; RegexParse {
> &gt;&nbsp;&nbsp;&nbsp;&nbsp; regex_parse_field= "value"
> &gt;&nbsp;&nbsp;&nbsp;&nbsp; regex =
> """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+&gt;(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
> &gt;&nbsp;&nbsp;&nbsp;&nbsp; groupMap = {
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> what_log_orig:0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> # 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED asap
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> device_ip:1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> # 192.168.73.1
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> operation_time:2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #
> Apr 13 16:27:33
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> device_name:3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> # asap91
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> operation_command:4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # su
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> how_op_res:5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> # FAILED
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> slave_account_name:6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # asap
> &gt;&nbsp;&nbsp;&nbsp;&nbsp; }
> &gt;&nbsp; }
> &gt; }
> &gt;
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; regex_parse_field: Upstream fields that require
> parsing.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; regex: regular expression.
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; groupMap: The correspondence between result fields
> and regular capture
> &gt;&nbsp;&nbsp;&nbsp; group indexes.
> &gt;
> &gt; *3.2 Execution Logic*
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; When starting a job, RegexParseTransform will:
> &gt;&nbsp;&nbsp;&nbsp; 1.
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Construct parameters in
> RegexParseTransform to obtain the values of
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; three core parameters and the
> index of the regex_parse_fieldfield.
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; On receiving a record, the RegexParseTransform will:
> &gt;&nbsp;&nbsp;&nbsp; 1.
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Retrieve the data value
> corresponding to regex_marse_field, perform
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; regular matching and logical
> verification.
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Traverse the corresponding
> relationship values of the groupMap
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; field, capture the group index,
> and extract the corresponding captured
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; group content of the regular
> matcher matcher.
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Assemble the extracted result
> values with the original values and
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; pass them downstream.On receiving
> a record, the RegexParseTransform
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; will:
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; When converting data structures,
> RegexParseTransform will:
> &gt;&nbsp;&nbsp;&nbsp; 1.
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Traverse the relationship values
> corresponding to the groupMap
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fields, and by default, set the
> data structure type of all result fields to
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; String (or convert them to their
> respective types based on their
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; data format)
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -
> &gt;
> &gt;&nbsp;&nbsp;&nbsp; This transformation is implemented by inheriting
> SeaTunnel's
> &gt;&nbsp;&nbsp;&nbsp; AbstractCatalogSupportTransform API and will be
> fully parallel..
> &gt;
> &gt; ------------------------------
> &gt; *4. Implementation Plan*
> &gt; TaskDescription
> &gt; Phase 1 Support extracting multiple key information through
> &gt; regularization.
> &gt; Phase 2 Support converting them into their respective types based on
> data
> &gt; format.
> &gt; Phase 3 Add test coverage (unit + e2e) and documentation on website
> &gt; *5. End*
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I sincerely appreciate the
> opportunity to contribute to the Apache
> &gt; SeaTunnel open source project. The attachment is my proposal. Thank
> you for
> &gt; taking the time to review it amidst your busy schedule. Looking
> forward
> &gt; to your reply.
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Best regards,
> &gt;
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Shi Desheng
> &gt;

Reply via email to