Proposal??Add a 'RegexParseTransform' plugin in Apache SeaTunnel to parse irregular logs into structured logs

?????N Sun, 11 May 2025 23:27:07 -0700

Dear Apache SeaTunnel community members:

Hello!


&nbsp; &nbsp; &nbsp; I am Shi Desheng, a big data engineer who has been 
following the development of the SeaTunnel project for a long time. Through the 
use and research of the project, it was found that SeaTunnel is unable to parse 
and store complex data scenarios, such as host logs, program run logs, and 
other irregular log formats; It may lead potential users who need this feature 
to give up using SeaTunnel, so I conducted secondary development and built a 
log parsing RegexParseTransform plugin based on regular expressions, which can 
parse irregular logs into structured logs. I am willing to contribute to the 
project and contribute this feature to the open source community, working 
together with SeaTunnel open source community members to accelerate the sprint 
to become the world's top open source data synchronization tool.



1. Background and Motivation

With the continuous complexity of data sources and the rapid evolution of 
business requirements, universal data integration frameworks often face many 
challenges in the actual implementation process. Among them, SeaTunnel lacks 
universal parsing capabilities when dealing with raw logs with variable, 
irregular, and even deeply nested data formats (such as Apache/Nginx access 
logs, Linux Host logs, system syslogs, and custom program print logs). And 
these data are precisely an indispensable part of enterprise data governance 
and real-time monitoring. Therefore, improving SeaTunnel's ability to parse 
irregular logs and adding the RegexParseTransform plugin can not only expand 
its application scenarios, but also enhance its competitiveness in areas such 
as log analysis and observability platforms.



2. Goals


Provide a  Transform plugin named RegexParseTransform for parsing irregular 
logs into structured logs.



Support for parsing irregular logs, such as:


Apache/Nginx access logs



Linux Host logs



system syslogs



custom program print logs




Support multiple key information extraction.



Support retaining original logs.



Support one configuration to parse an entire irregular log.



Compatible with both BATCH and STREAMING job modes.



Transparent integration with all connector-v2 pipelines (no need to modify 
source/sink plugins).




3. Design Overview

3.1 Plugin Configuration
transform {
 &nbsp;# ????????
 &nbsp;# 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED asap
 &nbsp;RegexParse {
 &nbsp; &nbsp;regex_parse_field= "value"
 &nbsp; &nbsp;regex = 
"""(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+&gt;(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
 &nbsp; &nbsp;groupMap = {
 &nbsp; &nbsp; &nbsp; &nbsp;what_log_orig:0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
# 192.168.73.1~<37&gt;Apr 13 16:27:33 asap91 su: FAILED asap
 &nbsp; &nbsp; &nbsp; &nbsp;device_ip:1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; # 192.168.73.1
 &nbsp; &nbsp; &nbsp; &nbsp;operation_time:2 &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp;# Apr 13 16:27:33
 &nbsp; &nbsp; &nbsp; &nbsp;device_name:3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; # asap91
 &nbsp; &nbsp; &nbsp; &nbsp;operation_command:4 &nbsp; &nbsp; &nbsp; # su
 &nbsp; &nbsp; &nbsp; &nbsp;how_op_res:5 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp;# FAILED
 &nbsp; &nbsp; &nbsp; &nbsp;slave_account_name:6 &nbsp; &nbsp; &nbsp;# asap
 &nbsp;  }
 }
}

regex_parse_field: Upstream fields that require parsing.



regex: regular expression.



groupMap: The correspondence between result fields and regular capture group 
indexes.


3.2 Execution Logic


When starting a job, RegexParseTransform will??


Construct parameters in RegexParseTransform to obtain the values of three core 
parameters and the index of the regex_parse_fieldfield.




On receiving a record, the RegexParseTransform will:


Retrieve the data value corresponding to regex_marse_field, perform regular 
matching and logical verification.



Traverse the corresponding relationship values of the groupMap field, capture 
the group index, and extract the corresponding captured group content of the 
regular matcher matcher.



Assemble the extracted result values with the original values and pass them 
downstream.On receiving a record, the RegexParseTransform will:




When converting data structures, RegexParseTransform will??


Traverse the relationship values corresponding to the groupMap fields, and by 
default, set the data structure type of all result fields to String (or convert 
them to their respective types based on their data format)




This transformation is implemented by inheriting SeaTunnel's 
AbstractCatalogSupportTransform API and will be fully parallel..




4. Implementation Plan
TaskDescription
Phase 1Support extracting multiple key information through regularization.
Phase 2Support converting them into their respective types based on data format.
Phase 3Add test coverage (unit + e2e) and documentation on website

5. End



&nbsp; &nbsp; &nbsp; &nbsp;I sincerely appreciate the opportunity to contribute 
to the Apache SeaTunnel open source project. The attachment is my proposal. 
Thank you for taking the time to review it amidst your busy schedule. Looking 
forward to your reply.

&nbsp; &nbsp; &nbsp; Best regards,

&nbsp; &nbsp; &nbsp; Shi Desheng

Proposal??Add a 'RegexParseTransform' plugin in Apache SeaTunnel to parse irregular logs into structured logs

Reply via email to