[incubator-seatunnel-website] branch main updated: add (#145)

nielifeng Mon, 19 Sep 2022 01:16:08 -0700

This is an automated email from the ASF dual-hosted git repository.

nielifeng pushed a commit to branch main
in repository 
https://gitbox.apache.org/repos/asf/incubator-seatunnel-website.git



The following commit(s) were added to refs/heads/main by this push:
     new cc9b32326 add (#145)
cc9b32326 is described below

commit cc9b3232626cb7fc43367dea116f5b84f6c1b5bd
Author: lifeng <[email protected]>
AuthorDate: Mon Sep 19 16:15:57 2022 +0800

    add (#145)
---
 blog/2022-09-12-SeaTunnel-2.1.3-released.md        |  60 +++++
 ...tributors | Why-do I contribute-to-SeaTunnel.md |  40 ++++
 ...-for-SeaTunnel-Connector-Development-Process.md | 242 +++++++++++++++++++++
 3 files changed, 342 insertions(+)

diff --git a/blog/2022-09-12-SeaTunnel-2.1.3-released.md 
b/blog/2022-09-12-SeaTunnel-2.1.3-released.md
new file mode 100644
index 000000000..97dd98bd0
--- /dev/null
+++ b/blog/2022-09-12-SeaTunnel-2.1.3-released.md
@@ -0,0 +1,60 @@
+# SeaTunnel 2.1.3 released! Introducing in Assert Sink connector and NullRate, 
Nulltf Transform
+
+![](https://miro.medium.com/max/1400/1*7jtTFNpvwC6nquA-BLfqGg.png)
+
+More than a month after the release of Apache SeaTunnel(Incubating) 2.1.2, we 
have been collecting user and developer feedback to bring you version 2.1.3. 
The new version introduces the Assert Sink connector, which is an inurgent need 
in the community, and two Transforms, NullRate and Nulltf. Some usability 
problems in the previous version have also been fixed, improving stability and 
efficiency.
+
+This article will introduce the details of the update of Apache 
SeaTunnel(Incubating) **version 2.1.3**.
+
+* Release Note: 
[https://github.com/apache/incubator-seatunnel/blob/2.1.3/release-note.md](https://github.com/apache/incubator-seatunnel/blob/2.1.3/release-note.md)
+* Download address: 
[https://seatunnel.apache.org/download](https://seatunnel.apache.org/download)
+
+## Major feature updates
+### Introduces Assert Sink connector
+Assert Sink connector is introduced in SeaTunnel version 2.1.3to verify data 
correctness. Special thanks to Lhyundeadsoul for his contribution.
+
+### Add two Transforms
+In addition, the 2.1.3 version also adds two Transforms, NullRate and Nulltf, 
which are used to detect data quality and convert null values in the data to 
generate default values. These two Transforms can effectively improve the 
availability of data and reduce the frequency of abnormal situations. Special 
thanks to wsyhj and Interest1-wyt for their contributions.
+
+At present, SeaTunnel has supported 9 types of Transforms including Common 
Options, Json, NullRate, Nulltf, Replace, Split, SQL, UDF, and UUID, and the 
community is welcome to contribute more Transform types.
+
+For details of Transform, please refer to the official documentation: 
[https://seatunnel.apache.org/docs/2.1.3/category/transform](https://seatunnel.apache.org/docs/2.1.3/category/transform)
+
+### ClickhouseFile connector supports Rsync data transfer method now
+At the same time, SeaTunnel 2.1.3 version brings Rsync data transfer mode 
support to ClickhouseFile connector, users can now choose SCP and Rsync data 
transfer modes. Thanks to Emor-nj for contributing to this feature.
+
+### Specific feature updates:
+
+* Flink Fake data supports BigInteger type 
https://github.com/apache/incubator-seatunnel/pull/2118
+* Add Flink Assert Sink connector 
https://github.com/apache/incubator-seatunnel/pull/2022
+* Spark ClickhouseFile connector supports Rsync data file transfer method 
https://github.com/apache/incubator-seatunnel/pull/2074
+* Add Flink Assert Sink e2e module 
https://github.com/apache/incubator-seatunnel/pull/2036
+* Add NullRate Transform for detecting data quality 
https://github.com/apache/incubator-seatunnel/pull/1978
+* Add Nulltf Transform for setting defaults 
https://github.com/apache/incubator-seatunnel/pull/1958
+### Optimization
+* Refactored Spark TiDB-related parameter information
+* Refactor the code to remove redundant code warning information
+* Optimize connector jar package loading logic
+* Add Plugin Discovery module
+* Add documentation for some modules
+* Upgrade common-collection from version 4 to 4.4
+* Upgrade common-codec version to 1.13
+### Bug Fix
+In addition, in response to the feedback from users of version 2.1.2, we also 
fixed some usability issues, such as the inability to use the same components 
of Source and Sink, and further improved the stability.
+
+* Fixed the problem of Hudi Source loading twice
+* Fix the problem that the field TwoPhaseCommit is not recognized after Doris 
0.15
+* Fixed abnormal data output when accessing Hive using Spark JDBC
+* Fix JDBC data loss when partition_column (partition mode) is set
+* Fix KafkaTableStream schema JSON parsing error
+* Fix Shell script getting APP_DIR path error
+* Updated Flink RunMode enumeration to get correct help messages for run modes
+* Fix the same source and sink registered connector cache error
+* Fix command line parameter -t( — check) conflict with Flink deployment 
target parameter
+* Fix Jackson type conversion error problem
+* Fix the problem of failure to run scripts in paths other than SeaTunnel_Home
+### Acknowledgment
+Thanks to all the contributors (GitHub ID, in no particular order,), it is 
your efforts that fuel the launch of this version, and we look forward to more 
contributions to the Apache SeaTunnel(Incubating) community!
+
+`leo65535, CalvinKirs, mans2singh, ashulin, wanghuan2054, lhyundeadsoul, 
tobezhou33, Hisoka-X, ic4y, wsyhj, Emor-nj, gleiyu, smallhibiscus, Bingz2, 
kezhenxu94, youyangkou, immustard, Interest1-wyt, superzhang0929, gaaraG, 
runwenjun`
+
diff --git a/blog/2022-09-14-Talk -With-Overseas-contributors | Why-do I 
contribute-to-SeaTunnel.md b/blog/2022-09-14-Talk -With-Overseas-contributors | 
Why-do I contribute-to-SeaTunnel.md
new file mode 100644
index 000000000..029ebb59e
--- /dev/null
+++ b/blog/2022-09-14-Talk -With-Overseas-contributors | Why-do I 
contribute-to-SeaTunnel.md    
@@ -0,0 +1,40 @@
+# Talk With Overseas contributors | Why do I contribute to SeaTunnel?
+
+As SeaTunnel gets popular around the world, it is attracting more and more 
contributors from overseas to join the open-source career. Among them, a big 
data platform engineer at Kakao enterprise corp., Namgung Chan has recently 
contributed the Neo4j Sink Connector for the SeaTunnel. We have a talk with him 
to know why SeaTunnel is attractive to him, and how he thinks SeaTunnel should 
gain popularity in the South Korean market.
+
+## Personal Profile
+![](https://miro.medium.com/max/1400/1*sKzXjqu6M_VmoperNBYUGQ.jpeg)
+
+Namgung Chan, South Korea, Big Data Platform Engineer at Kakao enterprise corp.
+
+Blog (written in Korean): https://getchan.github.io/
+GitHub ID: https://github.com/getChan
+LinkedIn : https://www.linkedin.com/in/namgung-chan-6a06441b6/
+
+### Contributions to the community
+He writes the Neo4j Sink Connector code for the new SeaTunnel Connector API.
+
+### How to know SeaTunnel for the first time?
+It’s the first time Namgung Chan to engage in open source. He wants to learn 
technical skills by contributing, at the same time experience the open-source 
culture.
+
+For him, an open source project which is written by java lang, and made for 
data engineering, has many issues of ‘help wanted’ or ‘good first issue’ is 
quite suitable. Then he found SeaTunnel on the Apache Software Foundation 
project webpage.
+
+### The first impression of SeaTunnel Community
+Though it was his first open source experience, he felt it was comfortable and 
interesting to go to the community. He also felt very welcome, because there 
are many ‘good first issue, and ‘volunteer wanted’ tagged issues and will get a 
quick response of code review.
+
+With gaining knowledge of Neo4j, he grows much more confident in open source 
contribution.
+
+### Research and comparison
+Before knowing about SeaTunnel, Namgung Chan used Spring Cloud Data Flow for 
data integration. While after experiencing SeaTunnel, he thinks the latter is 
more lightweight than SCDF, because in SCDF, every source, processor, and sink 
component are individual applications, but SeaTunnel is not.
+
+Though hasn’t used SeaTunnel in his working environment yet, Namgung Chan said 
he would like to use it positively when he is in need, especially for data 
integration for various data storage.
+
+
+### Expectations for SeaTunnel
+The most exciting new features or optimizations for Namgung Chan are:
+
+Data Integration for various data storage.
+Strict data validation. monitoring extension
+Low computing resource
+exactly-once data processing
+In the future, Namgung Chan plans to keep contributing from light issues to 
heavy ones, and we hope he will have a good time here!
\ No newline at end of file
diff --git 
a/blog/2022-09-19-Code-Demo-for-SeaTunnel-Connector-Development-Process.md 
b/blog/2022-09-19-Code-Demo-for-SeaTunnel-Connector-Development-Process.md
new file mode 100644
index 000000000..1025e1a6f
--- /dev/null
+++ b/blog/2022-09-19-Code-Demo-for-SeaTunnel-Connector-Development-Process.md
@@ -0,0 +1,242 @@
+# Code Demo for SeaTunnel Connector Development Process
+At the Apache SeaTunnel&Apache Doris Joint Meetup held on July 24, Liu Li — 
senior engineer of WhaleOps and contributor to Apache SeaTunnel — mentioned an 
easy way to develop a connector in SeaTunnel quickly.
+
+![](https://miro.medium.com/max/700/1*Rbd5BrSuGiZUQA53DXZrBw.png)
+We’ll divide it into four key parts:
+
+● The definition of a Connector
+
+● How to access data sources and targets
+
+● Code to demonstrate how to implement a Connector
+
+● Sources and targets that are currently supported
+
+## Definition of a Connector
+The Connector consists of Source and Sink and is a concrete implementation of 
accessing data sources.
+
+Source: The Source is responsible for reading data from sources such as 
MySQLSource, DorisSource, HDFSSource, TXTSource, and more.
+
+Sink: The Sink is responsible for writing read data to the target, including 
MySQLSink, ClickHouseSink, HudiSink, and more. Data transfer, and more 
specifically, data synchronization is completed through the cooperation between 
the Source and Sink.
+
+![](https://miro.medium.com/max/298/1*hsfa9Xtzt7o028XjCpqoOg.png)
+
+Of course, different sources and sinks can cooperate with each other.
+
+For example, you can use MySQL Source, and Doris Sink to synchronize data from 
MySQL to Doris, or even read data from MySQL Source and write to HDFS Sink.
+
+## How to access data sources and targets
+
+### How to access Source
+Firstly, let’s take a look at how we can access the Source. To elaborate, 
let’s dive in and check out how we can implement a source and the core 
interfaces that need to be implemented to access the Source.
+
+The simplest Source is a single concurrent Source. However, if a source does 
not support state storage and other advanced functions, what interfaces should 
we implement in these simple single concurrent sources?
+
+Firstly, we need to use getBoundedness in the Source to identify whether the 
Source supports real-time or offline, or both.
+
+createReader creates a Reader whose main function is to read the specific 
implementation of data. A single concurrent source is really simple as we only 
need to implement one method, pollNext, through which the read data is sent.
+
+If concurrent reading is required, what additional interfaces should we 
implement?
+![](https://miro.medium.com/max/393/1*bRxRjyMOGkVqseQkg0ONWg.png)
+
+For concurrent reading, we’ll introduce a new member, called the Enumerator.
+
+We implement createEnumerator in Source, and the main function of this member 
is to create an Enumerator to split the task into segments and then send it to 
the Reader.
+
+For example, a task can be divided into 4 splits.
+
+If it is concurrent twice, it’ll correspond to two Readers. Two of the four 
splits will be sent to Reader1, and the other two will be sent to Reader2.
+
+If the number of concurrencies is more — for example, let’s say there are four 
concurrences, then you have to create four Readers. You have to use the 
corresponding four splits for concurrent reading for improved efficiency.
+
+A corresponding interface in the Enumerator called the addSplitsBack sends the 
splits to the corresponding Reader. Through this method, the ID of the Reader 
can be specified.
+
+Similarly, there is an interface called the addSplits in the Reader to receive 
the splits sent by the Enumerator for data reading.
+
+In a nutshell, for concurrent reading, we need an Enumerator to implement task 
splitting and send the splits to the reader. Also, the reader receives the 
splits and uses them for reading.
+
+In addition, if we need to support resuming and exactly-once semantics, what 
additional interfaces should we implement?
+
+If the goal is to resume the transfer from a breakpoint, we must save the 
state and restore it. For this, we need to implement a restoreEnumerator in 
Source.
+
+The restoreEnumerator method is used to restore an Enumerator through the 
state and restore the split.
+
+Correspondingly, we need to implement a snapshotState in this enumerator, 
which is used to save the state of the current Enumerator and perform failure 
recovery during checkpoints.
+
+At the same time, the Reader will also have a snapshotState method to save the 
split state of the Reader.
+
+In the event of a failed restart, the Enumerator can be restored through the 
saved state. After the split is restored, reading can be continued from the 
place of failure, including fetching and incoming data.
+
+The exact one-time semantics actually requires the source to support data 
replays, such as Kafka, Pulsar, and others. In addition, the sink must be 
submitted in two phases, i.e., the precise one-time semantics can be achieved 
with the cooperation of these two sources and sinks.
+
+### How to access Sink
+
+Now, let’s take a look at how to connect to the Sink. What interfaces does the 
Sink need to implement?
+
+Truth be told, Sink is relatively simple. For concurrent sinks, when state 
storage and two-phase commit are not supported, the Sink is simple.
+
+To elaborate, the Sink does not distinguish between stream synchronization and 
batch synchronization as the Sink — and the entire SeaTunnel API system — 
supports **Unified Stream and Batch Processing.**
+
+Firstly, we need to implement createWriter. A Writer is used for data writing.
+
+You need to implement a writer method in Writer through which data is written 
to the target library.
+
+![](https://miro.medium.com/max/414/1*xQ7DRHdBGv-ofjSYdSHAoA.png)
+
+As shown in the figure above, if two concurrencies are set, the engine will 
call the createWriter method twice in order to generate two Writers. The engine 
will feed data to these two writers, which will write the data to the target 
through the write method.
+
+For a more advanced setup, for example, we need to support **two-phase commit 
and state storage**.
+
+Here, what additional interfaces should we implement?
+
+First, let’s introduce a new member, the Committer, whose main role is for the 
second-stage commit.
+
+![](https://miro.medium.com/max/414/1*cvj1i2A-E-1c_bCZneshtg.png)
+
+Since Sink is stored in state, it is necessary to restore Writer through the 
state. Hence, restoreWriter should be implemented.
+
+Also, since we have introduced a new member, the Committer, we should also 
implement a createCommitter in the sink. We can then use this method to create 
a Committer for the second-stage commit or rollback.
+
+In this case, what additional interfaces does Writer need to implement?
+
+Since it is a two-phase commit, the first-phase commit is done in the Writer 
through the implementation of the prepareCommit method — which is mainly used 
for the first-phase commit.
+
+In addition, state storage and failure recovery is also supported, meaning we 
need snapshotState to take snapshots at checkpoints. This saves the state for 
failure recovery scenarios.
+
+The Committer is the core here. It is mainly used for rollback and commit 
operations in the second phase.
+
+For the corresponding process, we need to write data to the database. Here, 
the engine will trigger the first stage commit during the checkpoint, and then 
the Writer needs to prepare a commit.
+
+At the same time, it will return commitInfo to the engine, and the engine will 
judge whether the first stage commits of all writers are successful.
+
+If they are indeed successful, the engine will use the commit method to 
actually commit.
+
+For MySQL, the first-stage commit just saves a transaction ID and sends it to 
the commit. The engine determines whether the transaction ID is committed or 
rolled back.
+
+## How to implement the Connector
+We’ve taken a look at Source and Sink; let’s now look at how to access the 
data source and implement your own Connector.
+
+Firstly, we need to build a development environment for the Connector.
+
+### The necessary environment
+1. Java 1.8\11, Maven, IntelliJ IDEA
+
+2. Windows users need to additionally download gitbash 
(https://gitforwindows.org/)
+
+3. Once you have these, you can download the SeaTunnel source code by cloning 
the git.
+
+4. Download SeaTunnel source code 1, git clone 
https://github.com/apache/incubator-seatunnel.git2, cd incubator-seatunnel
+
+### SeaTunnel Engineering Structure
+We then open it again through the IDE, and see the directory structure as 
shown in the figure:
+
+![](https://miro.medium.com/max/700/1*utRhNAsYiqQqBFa4Tjewgw.png)
+
+The directory is divided into several parts:
+
+1. Connector — v2
+
+Specific implementation of the new Connector(Connector — v2) will be placed in 
this module.
+
+2. connector-v2-dist
+
+The translation layer of the new connector translates into specific engine 
implementation — instead of implementing under corresponding engines such as 
Spark, Flink, and ST-Engine. ST-Engine is the “important, big project” the 
community is striving to implement. This project is worth the wait.
+
+3. examples
+
+This package provides a single-machine local operation method, which is 
convenient for debugging while implementing the Connector.
+
+4. e2e
+
+The e2e package is for e2e testing of the Connector.
+
+Next, let’s check out how a Connector can be created (based on the new 
Connector). Here is the step-by-step process:
+
+1. Create a new module in the seatunnel-connectors-v2 directory and name it 
this way: connector-{connector name}.
+
+2. The pom file can refer to the pom file of the existing connector and add 
the current child model to the parent model’s pom file.
+
+3. Create two new packages corresponding to the packages of Source and Sink, 
respectively:
+
+a. org.apache.seatunnel.connectors.seatunnel.{connector name}.source
+
+b. org.apache.seatunnel.connectors.seatunnel.{connector name}.sink
+
+Take this mysocket example shown in the figure:
+
+![](https://miro.medium.com/max/700/1*K1btD2gNwYxj96OJnPfW2Q.png)
+
+To do some implementation, develop the connector. During implementation, you 
can use the example module for local debugging if you need to debug. That said, 
this module mainly provides the local running environment of Flink and Spark.
+
+![](https://miro.medium.com/max/700/1*qOc3q7okzo7jObHxloc7WQ.png)
+
+As you can see in the image, there are numerous examples under the “Example” 
module — including seatunnel-flink-connector-v2-example.
+
+So how do you use them?
+
+Let’s take an example. The debugging steps on Flink are as follows (these 
actions are under the seatunnel-flink-connector-v2-example module:
+
+1. Add connector dependencies in pom.xml
+
+2. Add the task configuration file under resources/examples
+
+3. Configure the file in the SeaTunnelApiExample main method
+
+4. Run the main method
+
+### Code Demo
+
+This code demonstration is based on DingTalk.
+
+Here’s a reference（ 19:35s–37:10s）:
+
+https://weixin.qq.com/sph/A1ri7B
+
+![](https://miro.medium.com/max/700/1*ej9ronizPtC09ILWJDlbUg.png)
+
+### New Connectors supported at this stage
+
+As of July 14, contributions and statistics for the completed connectors are 
welcome. You are more than welcome to try them out, and raise issues in our 
community if you find bugs.
+
+![](https://miro.medium.com/max/700/1*RHNJDcbvKmSt2UGGSz3Icg.png)
+
+The Connector shared below have already been claimed and developed:
+
+![](https://miro.medium.com/max/700/1*RHNJDcbvKmSt2UGGSz3Icg.png)
+
+Also, we have Connectors in the roadmap — the connectors we want to support in 
the near future. To foster the process, the SeaTunnel Community initiated 
SeaTunnel Connector Access Incentive Plan, you are more than welcome to 
contribute to the project.
+
+SeaTunnel Connector Access Incentive Plan: 
https://github.com/apache/incubator-seatunnel/issues/1946
+
+You can claim tasks that haven’t been marked in the comment area, and take a 
spree home! Here is part of the connectors that need to be accessed as soon as 
possible:
+![](https://miro.medium.com/max/414/1*n-ixPtq066Acx4Ja5qNQqw.png)
+In fact, the implementations of Connectors like Feishu, DingTalk, and Facebook 
messenger are quite simple as the connectors do not need to carry a large 
amount of data (just a simple Source and Sink). This is in sharp contrast to 
Hive and other databases that need to consider transaction consistency or 
concurrency issues.
+
+We welcome everyone to make contributions and join our Apache SeaTunnel family!
+
+## About SeaTunnel
+SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance 
distributed data integration platform that supports the real-time 
synchronization of massive amounts of data and can synchronize hundreds of 
billions of data per day stably and efficiently.
+
+### Why do we need SeaTunnel?
+
+SeaTunnel does everything it can to solve the problems you may encounter in 
synchronizing massive amounts of data.
+
+* Data loss and duplication
+* Task buildup and latency
+* Low throughput
+* Long application-to-production cycle time
+* Lack of application status monitoring
+
+### SeaTunnel Usage Scenarios
+* Massive data synchronization
+* Massive data integration
+* ETL of large volumes of data
+* Massive data aggregation
+* Multi-source data processing
+
+### Features of SeaTunnel
+
+* Rich components
+* High scalability
+* Easy to use
+* Mature and stable

[incubator-seatunnel-website] branch main updated: add (#145)

Reply via email to