hi, the target of this engine is not designed to replace Flink or Spark,users can choose which engine to run
Best Regards --------------- Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC Lidong Dai [email protected] Linkedin: https://www.linkedin.com/in/dailidong Twitter: @WorkflowEasy --------------- hi Best Regards --------------- Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC Lidong Dai [email protected] Linkedin: https://www.linkedin.com/in/dailidong Twitter: @WorkflowEasy --------------- On Sun, Jun 5, 2022 at 12:44 AM Lidong Dai <[email protected]> wrote: > > hi, > > > > 1. the relationship with Flink/Spark. Our engine is designed to replace > Flink/Spark? > > > > the target of this engine is not designed to replace Flink or Spark,users can > choose which engine to run > > > > > > > > > Best Regards > > > > > > > > --------------- > > Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC > > Lidong Dai > > [email protected] > > Linkedin: https://www.linkedin.com/in/dailidong > > Twitter: @WorkflowEasy > > > > --------------- > > > > > > On Mon, May 30, 2022 at 9:01 PM 陶克路 <[email protected]> wrote: > > Hi, gaojun, thanks for sharing. > > I have some problems about the engine: > > 1. the relationship with Flink/Spark. Our engine is designed to replace > Flink/Spark? > 2. if designed to replace Flink/Spark, how to build the huge thing from > scratch? > 3. If designed above Flink/Spark, how to achieve our goal without > modifying Flink/Spark code? > > > Thanks, > Kelu > > On Fri, May 27, 2022 at 6:07 PM JUN GAO <[email protected]> wrote: > > > Why do we need the SeaTunnel Engine, And what problems do we want to solve? > > > > > > - *Better resource utilization rate* > > > > Real time data synchronization is an important user scenario. Sometimes we > > need real time synchronization of a full database. Now, Some common data > > synchronization engine practices are one job per table. The advantage of > > this practice is that one job failure does not influence another one. But > > this practice will cause more waste of resources when most of the tables > > only have a small amount of data. > > > > We hope the SeaTunnel Engine can solve this problem. We plan to support a > > more flexible resource share strategy. It will allow some jobs to share the > > resources when they submit by the same user. Users can even specify which > > jobs share resources between them. If anyone has an idea, welcome to > > discuss in the mail list or github issue. > > > > > > - *Fewer database connectors* > > > > Another common problem in full database synchronization use CDC is each > > table needs a database connector. This will put a lot of pressure on the db > > server when there are a lot of tables in the database. > > > > Can we design the database connectors as a shared resource between jobs? > > users can configure their database connectors pool. When a job uses the > > connector pool, SeaTunnel Engine will init the connector pool at the node > > which the source/sink connector at. And then push the connector pool in the > > source/sink connector. With the feature of Better resource utilization > > rate > > < > > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8 > > >, > > we can reduce the number of database connections to an acceptable range. > > > > Another way to reduce database connectors used by CDC Source Connector is > > to make multiple table read support in CDC Source Connector. And then the > > stream will be split by table name in the SeaTunnel Engine. > > > > This way reduces database connectors used by CDC Source Connector but it > > can not reduce the database connectors used by sink if the synchronization > > target is database too. So a shared database connector pool will be a good > > way to solve it. > > > > > > - *Data Cache between Source and Sink* > > > > > > > > Flume is an excellent data synchronization project. Flume Channel can cache > > data > > > > when the sink fails and can not write data. This is useful in some > > scenarios. > > For example, some users have limited time to save their database logs. CDC > > Source Connector must ensure it can read database logs even if sink can not > > write data. > > > > A feasible solution is to start two jobs. One job uses CDC Source > > Connector to read database logs and then use Kafka Sink Connector to write > > data to kafka. And another job uses Kafka Source Connector to read data > > from kafka and then use the target Sink Connector to write data to the > > target. This solution needs the user to have a deep understanding of > > low-level technology, And two jobs will increase the difficulty of > > operation and maintenance. Because every job needs a JobMaster, So it will > > need more resources. > > > > Ideally, users only know they will read data from source and write data to > > the sink and at the same time, in this process, the data can be cached in > > case the sink fails. The synchronization engine needs to auto add cache > > operation to the execution plan and ensure the source can work even if the > > sink fails. In this process, the engine needs to ensure the data written to > > the cache and read from the cache is transactional, this can ensure the > > consistency of data. > > > > The execution plan like this: > > > > > > - *Schema Evolution* > > > > Schema evolution is a feature that allows users to easily change a table’s > > current schema to accommodate data that is changing over time. Most > > commonly, it’s used when performing an append or overwrite operation, to > > automatically adapt the schema to include one or more new columns. > > > > This feature is required in real-time data warehouse scenarios. Currently, > > flink and spark engines do not support this feature. > > > > > > - *Finer fault tolerance* > > > > At present, most real-time processing engines will make the job fail when > > one of the tasks is failed. The main reason is that the downstream operator > > depends on the calculation results of the upstream operator. However, in > > the scenario of data synchronization, the data is simply read from the > > source and then written to sink. It does not need to save the intermediate > > result state. Therefore, the failure of one task will not affect whether > > the results of other tasks are correct. > > > > The new engine should provide more sophisticated fault-tolerant management. > > It should support the failure of a single task without affecting the > > execution of other tasks. It should provide an interface so that users can > > manually retry failed tasks instead of retrying the entire job. > > > > > > - *Speed Control* > > > > In Batch jobs, we need support speed control. Let users choose the > > synchronization speed they want to prevent too much impact on the source or > > target database. > > > > > > > > *More Information* > > > > > > I make a simple design about SeaTunnel Engine. You can learn more details > > in the following documents. > > > > > > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub > > > > > > -- > > > > Best Regards > > > > ------------ > > > > Apache DolphinScheduler PMC > > > > Jun Gao > > [email protected] > > > > > -- > > Hello, Find me here: www.legendtkl.com.
