good suggestion.
Best Regards --------------- David Linkedin: https://www.linkedin.com/in/davidzollo --------------- On Mon, May 12, 2025 at 8:59 PM 史德昇 <[email protected]> wrote: > No problem, I will create this issue and continue to track and complete it. > > > ------------------ 原始邮件 ------------------ > 发件人: > "dev" > < > [email protected]>; > 发送时间: 2025年5月12日(星期一) 晚上8:00 > 收件人: "dev"<[email protected]>; > > 主题: Re: Proposal:Add a 'RegexParseTransform' plugin in Apache > SeaTunnel to parse irregular logs into structured logs > > > > Thanks Desheng! Could you create an issue to track it? > > 史德昇 <[email protected]> 于2025年5月12日周一 14:25写道: > > > > > Dear Apache SeaTunnel community members: > > > > Hello! > > I am Shi Desheng, a big data > engineer who has been following the > > development of the SeaTunnel project for a long time. Through the use > and > > research of the project, it was found that SeaTunnel is unable to > parse and > > store complex data scenarios, such as host logs, program run logs, and > > other irregular log formats; It may lead potential users who need this > > feature to give up using SeaTunnel, so I conducted secondary > development > > and built a log parsing RegexParseTransform plugin based on regular > > expressions, which can parse irregular logs into structured logs. I am > > willing to contribute to the project and contribute this feature to > the > > open source community, working together with SeaTunnel open source > > community members to accelerate the sprint to become the world's top > open > > source data synchronization tool. > > > > *1. Background and Motivation* > > > > With the continuous complexity of data sources and the rapid > evolution of > > business requirements, universal data integration frameworks often > face > > many challenges in the actual implementation process. Among them, > SeaTunnel > > lacks universal parsing capabilities when dealing with raw logs with > > variable, irregular, and even deeply nested data formats (such as > > Apache/Nginx access logs, Linux Host logs, system syslogs, and custom > > program print logs). And these data are precisely an indispensable > part of > > enterprise data governance and real-time monitoring. Therefore, > improving > > SeaTunnel's ability to parse irregular logs and adding the > *RegexParseTransform > > plugin* can not only expand its application scenarios, but also > enhance > > its competitiveness in areas such as log analysis and observability > > platforms. > > ------------------------------ > > *2. Goals* > > > > - > > > > Provide a *Transform plugin named > RegexParseTransform* for parsing > > irregular logs into structured logs. > > - > > > > Support for *parsing irregular logs*, such as: > > - > > > > Apache/Nginx access logs > > - > > > > Linux Host logs > > - > > > > system syslogs > > - > > > > custom program print logs > > - > > > > Support *multiple key information extraction*. > > - > > > > Support *retaining original logs*. > > - > > > > Support one configuration to *parse an entire > irregular log*. > > - > > > > Compatible with both *BATCH* and *STREAMING* job > modes. > > - > > > > Transparent integration with all connector-v2 > pipelines (no need to > > modify source/sink plugins). > > > > ------------------------------ > > *3. Design Overview**3.1 Plugin Configuration* > > > > transform { > > # 样例数据 > > # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED > asap > > RegexParse { > > regex_parse_field= "value" > > regex = > """(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)""" > > groupMap = { > > > what_log_orig:0 > # 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap > > > device_ip:1 > # 192.168.73.1 > > > operation_time:2 # > Apr 13 16:27:33 > > > device_name:3 > # asap91 > > > operation_command:4 # su > > > how_op_res:5 > # FAILED > > > slave_account_name:6 # asap > > } > > } > > } > > > > > > - > > > > regex_parse_field: Upstream fields that require > parsing. > > - > > > > regex: regular expression. > > - > > > > groupMap: The correspondence between result fields > and regular capture > > group indexes. > > > > *3.2 Execution Logic* > > > > - > > > > When starting a job, RegexParseTransform will: > > 1. > > > > Construct parameters in > RegexParseTransform to obtain the values of > > three core parameters and the > index of the regex_parse_fieldfield. > > - > > > > On receiving a record, the RegexParseTransform will: > > 1. > > > > Retrieve the data value > corresponding to regex_marse_field, perform > > regular matching and logical > verification. > > 2. > > > > Traverse the corresponding > relationship values of the groupMap > > field, capture the group index, > and extract the corresponding captured > > group content of the regular > matcher matcher. > > 3. > > > > Assemble the extracted result > values with the original values and > > pass them downstream.On receiving > a record, the RegexParseTransform > > will: > > - > > > > When converting data structures, > RegexParseTransform will: > > 1. > > > > Traverse the relationship values > corresponding to the groupMap > > fields, and by default, set the > data structure type of all result fields to > > String (or convert them to their > respective types based on their > > data format) > > - > > > > This transformation is implemented by inheriting > SeaTunnel's > > AbstractCatalogSupportTransform API and will be > fully parallel.. > > > > ------------------------------ > > *4. Implementation Plan* > > TaskDescription > > Phase 1 Support extracting multiple key information through > > regularization. > > Phase 2 Support converting them into their respective types based on > data > > format. > > Phase 3 Add test coverage (unit + e2e) and documentation on website > > *5. End* > > > > I sincerely appreciate the > opportunity to contribute to the Apache > > SeaTunnel open source project. The attachment is my proposal. Thank > you for > > taking the time to review it amidst your busy schedule. Looking > forward > > to your reply. > > > > Best regards, > > > > Shi Desheng > >
