Dear Apache SeaTunnel community members:
Hello!
I am Shi Desheng, a big data engineer who has been
following the development of the SeaTunnel project for a long time. Through the
use and research of the project, it was found that SeaTunnel is unable to parse
and store complex data scenarios, such as host logs, program run logs, and
other irregular log formats; It may lead potential users who need this feature
to give up using SeaTunnel, so I conducted secondary development and built a
log parsing RegexParseTransform plugin based on regular expressions, which can
parse irregular logs into structured logs. I am willing to contribute to the
project and contribute this feature to the open source community, working
together with SeaTunnel open source community members to accelerate the sprint
to become the world's top open source data synchronization tool.
1. Background and Motivation
With the continuous complexity of data sources and the rapid evolution of
business requirements, universal data integration frameworks often face many
challenges in the actual implementation process. Among them, SeaTunnel lacks
universal parsing capabilities when dealing with raw logs with variable,
irregular, and even deeply nested data formats (such as Apache/Nginx access
logs, Linux Host logs, system syslogs, and custom program print logs). And
these data are precisely an indispensable part of enterprise data governance
and real-time monitoring. Therefore, improving SeaTunnel's ability to parse
irregular logs and adding the RegexParseTransform plugin can not only expand
its application scenarios, but also enhance its competitiveness in areas such
as log analysis and observability platforms.
2. Goals
Provide a Transform plugin named RegexParseTransform for parsing irregular
logs into structured logs.
Support for parsing irregular logs, such as:
Apache/Nginx access logs
Linux Host logs
system syslogs
custom program print logs
Support multiple key information extraction.
Support retaining original logs.
Support one configuration to parse an entire irregular log.
Compatible with both BATCH and STREAMING job modes.
Transparent integration with all connector-v2 pipelines (no need to modify
source/sink plugins).
3. Design Overview
3.1 Plugin Configuration
transform {
# ????????
# 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap
RegexParse {
regex_parse_field= "value"
regex =
"""(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})~<\d+>(\w+\s+\d{1,2}\s+\d{2}\:\d{2}\:\d{2})\s+(\w+)\s+(\w+)\:\s+(\w+)\s+(\w+)"""
groupMap = {
what_log_orig:0
# 192.168.73.1~<37>Apr 13 16:27:33 asap91 su: FAILED asap
device_ip:1
# 192.168.73.1
operation_time:2
# Apr 13 16:27:33
device_name:3
# asap91
operation_command:4 # su
how_op_res:5
# FAILED
slave_account_name:6 # asap
}
}
}
regex_parse_field: Upstream fields that require parsing.
regex: regular expression.
groupMap: The correspondence between result fields and regular capture group
indexes.
3.2 Execution Logic
When starting a job, RegexParseTransform will??
Construct parameters in RegexParseTransform to obtain the values of three core
parameters and the index of the regex_parse_fieldfield.
On receiving a record, the RegexParseTransform will:
Retrieve the data value corresponding to regex_marse_field, perform regular
matching and logical verification.
Traverse the corresponding relationship values of the groupMap field, capture
the group index, and extract the corresponding captured group content of the
regular matcher matcher.
Assemble the extracted result values with the original values and pass them
downstream.On receiving a record, the RegexParseTransform will:
When converting data structures, RegexParseTransform will??
Traverse the relationship values corresponding to the groupMap fields, and by
default, set the data structure type of all result fields to String (or convert
them to their respective types based on their data format)
This transformation is implemented by inheriting SeaTunnel's
AbstractCatalogSupportTransform API and will be fully parallel..
4. Implementation Plan
TaskDescription
Phase 1Support extracting multiple key information through regularization.
Phase 2Support converting them into their respective types based on data format.
Phase 3Add test coverage (unit + e2e) and documentation on website
5. End
I sincerely appreciate the opportunity to contribute
to the Apache SeaTunnel open source project. The attachment is my proposal.
Thank you for taking the time to review it amidst your busy schedule. Looking
forward to your reply.
Best regards,
Shi Desheng