-1, I agree with Kelu. I have same problems about the engine: 1. the relationship with Flink/Spark. Our engine is designed to replace Flink/Spark? 2. If designed above Flink/Spark, how to achieve our goal without modifying Flink/Spark code? 3. Why don't users use flink/spark directly, but use SeaTunnel Engine? 4. How do users use the flink/spark native connector?
________________________________ 发件人: 陶克路 <[email protected]> 发送时间: 2022年5月30日 20:59 收件人: [email protected] <[email protected]> 主题: Re: [DISCUSS] Do we need to have our own engine Hi, gaojun, thanks for sharing. I have some problems about the engine: 1. the relationship with Flink/Spark. Our engine is designed to replace Flink/Spark? 2. if designed to replace Flink/Spark, how to build the huge thing from scratch? 3. If designed above Flink/Spark, how to achieve our goal without modifying Flink/Spark code? Thanks, Kelu On Fri, May 27, 2022 at 6:07 PM JUN GAO <[email protected]> wrote: > Why do we need the SeaTunnel Engine, And what problems do we want to solve? > > > - *Better resource utilization rate* > > Real time data synchronization is an important user scenario. Sometimes we > need real time synchronization of a full database. Now, Some common data > synchronization engine practices are one job per table. The advantage of > this practice is that one job failure does not influence another one. But > this practice will cause more waste of resources when most of the tables > only have a small amount of data. > > We hope the SeaTunnel Engine can solve this problem. We plan to support a > more flexible resource share strategy. It will allow some jobs to share the > resources when they submit by the same user. Users can even specify which > jobs share resources between them. If anyone has an idea, welcome to > discuss in the mail list or github issue. > > > - *Fewer database connectors* > > Another common problem in full database synchronization use CDC is each > table needs a database connector. This will put a lot of pressure on the db > server when there are a lot of tables in the database. > > Can we design the database connectors as a shared resource between jobs? > users can configure their database connectors pool. When a job uses the > connector pool, SeaTunnel Engine will init the connector pool at the node > which the source/sink connector at. And then push the connector pool in the > source/sink connector. With the feature of Better resource utilization > rate > < > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8 > >, > we can reduce the number of database connections to an acceptable range. > > Another way to reduce database connectors used by CDC Source Connector is > to make multiple table read support in CDC Source Connector. And then the > stream will be split by table name in the SeaTunnel Engine. > > This way reduces database connectors used by CDC Source Connector but it > can not reduce the database connectors used by sink if the synchronization > target is database too. So a shared database connector pool will be a good > way to solve it. > > > - *Data Cache between Source and Sink* > > > > Flume is an excellent data synchronization project. Flume Channel can cache > data > > when the sink fails and can not write data. This is useful in some > scenarios. > For example, some users have limited time to save their database logs. CDC > Source Connector must ensure it can read database logs even if sink can not > write data. > > A feasible solution is to start two jobs. One job uses CDC Source > Connector to read database logs and then use Kafka Sink Connector to write > data to kafka. And another job uses Kafka Source Connector to read data > from kafka and then use the target Sink Connector to write data to the > target. This solution needs the user to have a deep understanding of > low-level technology, And two jobs will increase the difficulty of > operation and maintenance. Because every job needs a JobMaster, So it will > need more resources. > > Ideally, users only know they will read data from source and write data to > the sink and at the same time, in this process, the data can be cached in > case the sink fails. The synchronization engine needs to auto add cache > operation to the execution plan and ensure the source can work even if the > sink fails. In this process, the engine needs to ensure the data written to > the cache and read from the cache is transactional, this can ensure the > consistency of data. > > The execution plan like this: > > > - *Schema Evolution* > > Schema evolution is a feature that allows users to easily change a table’s > current schema to accommodate data that is changing over time. Most > commonly, it’s used when performing an append or overwrite operation, to > automatically adapt the schema to include one or more new columns. > > This feature is required in real-time data warehouse scenarios. Currently, > flink and spark engines do not support this feature. > > > - *Finer fault tolerance* > > At present, most real-time processing engines will make the job fail when > one of the tasks is failed. The main reason is that the downstream operator > depends on the calculation results of the upstream operator. However, in > the scenario of data synchronization, the data is simply read from the > source and then written to sink. It does not need to save the intermediate > result state. Therefore, the failure of one task will not affect whether > the results of other tasks are correct. > > The new engine should provide more sophisticated fault-tolerant management. > It should support the failure of a single task without affecting the > execution of other tasks. It should provide an interface so that users can > manually retry failed tasks instead of retrying the entire job. > > > - *Speed Control* > > In Batch jobs, we need support speed control. Let users choose the > synchronization speed they want to prevent too much impact on the source or > target database. > > > > *More Information* > > > I make a simple design about SeaTunnel Engine. You can learn more details > in the following documents. > > > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub > > > -- > > Best Regards > > ------------ > > Apache DolphinScheduler PMC > > Jun Gao > [email protected] > -- Hello, Find me here: www.legendtkl.com<http://www.legendtkl.com>.
