Hi, @zhangtaoliang Developers and users do not need to face three engines. I am concerned that the community is promoting the api-draft branch. I think connector developers only need to develop connectors based on the API draft branch. The same connector can run in spark/flink/seatunnel-engine at the same time.
张 涛 <[email protected]> 于2022年6月6日周一 10:48写道: > > Developers and users will face three type of engine connectors(spark, > flink, seatunnel), > this will be very painful for the users. > > Best, > Liming > ________________________________ > 发件人: 张 涛 <[email protected]> > 发送时间: 2022年6月6日 10:24 > 收件人: [email protected] <[email protected]> > 主题: 回复: [DISCUSS] Do we need to have our own engine > > > -1, > > > At the same time reduce the user's component deployment and maintenance > costs. > > More engines will be unfriendly for users. > > > I think both our own engine and Flink/Spark will exist in the short term. > > Why get this conclusion? > > > I think it is possible to achieve it step by step. > > OK, this is not a good message to users. > > BTW, apache Flink/Spark can also run standalone mode, not only cluster > mode. This is a bad decision, > hoping that the SeaTunnel community can listen to other people's opinions. > > Best, > Liming > ________________________________ > 发件人: 范佳 <[email protected]> > 发送时间: 2022年6月6日 10:06 > 收件人: [email protected] <[email protected]> > 主题: Re: [DISCUSS] Do we need to have our own engine > > +1 , > If we can implement the following features, it can help SeaTunnel > provide better usability and performance. At the same time reduce the > user's component deployment and maintenance costs. > I think both our own engine and Flink/Spark will exist in the > short term. For example, our engine can provide a simpler operation mode in > a single-machine environment, and Flink/Spark provides a clustered > operation mode. In the end, the replacement is the best result. > To achieve such a large engine, I think it is possible to achieve > it step by step. > > > 2022年5月27日 18:06,JUN GAO <[email protected]> 写道: > > > > Why do we need the SeaTunnel Engine, And what problems do we want to > solve? > > > > > > - *Better resource utilization rate* > > > > Real time data synchronization is an important user scenario. Sometimes > we > > need real time synchronization of a full database. Now, Some common data > > synchronization engine practices are one job per table. The advantage of > > this practice is that one job failure does not influence another one. But > > this practice will cause more waste of resources when most of the tables > > only have a small amount of data. > > > > We hope the SeaTunnel Engine can solve this problem. We plan to support a > > more flexible resource share strategy. It will allow some jobs to share > the > > resources when they submit by the same user. Users can even specify which > > jobs share resources between them. If anyone has an idea, welcome to > > discuss in the mail list or github issue. > > > > > > - *Fewer database connectors* > > > > Another common problem in full database synchronization use CDC is each > > table needs a database connector. This will put a lot of pressure on the > db > > server when there are a lot of tables in the database. > > > > Can we design the database connectors as a shared resource between jobs? > > users can configure their database connectors pool. When a job uses the > > connector pool, SeaTunnel Engine will init the connector pool at the node > > which the source/sink connector at. And then push the connector pool in > the > > source/sink connector. With the feature of Better resource utilization > rate > > < > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8 > >, > > we can reduce the number of database connections to an acceptable range. > > > > Another way to reduce database connectors used by CDC Source Connector is > > to make multiple table read support in CDC Source Connector. And then the > > stream will be split by table name in the SeaTunnel Engine. > > > > This way reduces database connectors used by CDC Source Connector but it > > can not reduce the database connectors used by sink if the > synchronization > > target is database too. So a shared database connector pool will be a > good > > way to solve it. > > > > > > - *Data Cache between Source and Sink* > > > > > > > > Flume is an excellent data synchronization project. Flume Channel can > cache > > data > > > > when the sink fails and can not write data. This is useful in some > scenarios. > > For example, some users have limited time to save their database logs. > CDC > > Source Connector must ensure it can read database logs even if sink can > not > > write data. > > > > A feasible solution is to start two jobs. One job uses CDC Source > > Connector to read database logs and then use Kafka Sink Connector to > write > > data to kafka. And another job uses Kafka Source Connector to read data > > from kafka and then use the target Sink Connector to write data to the > > target. This solution needs the user to have a deep understanding of > > low-level technology, And two jobs will increase the difficulty of > > operation and maintenance. Because every job needs a JobMaster, So it > will > > need more resources. > > > > Ideally, users only know they will read data from source and write data > to > > the sink and at the same time, in this process, the data can be cached in > > case the sink fails. The synchronization engine needs to auto add cache > > operation to the execution plan and ensure the source can work even if > the > > sink fails. In this process, the engine needs to ensure the data written > to > > the cache and read from the cache is transactional, this can ensure the > > consistency of data. > > > > The execution plan like this: > > > > > > - *Schema Evolution* > > > > Schema evolution is a feature that allows users to easily change a > table’s > > current schema to accommodate data that is changing over time. Most > > commonly, it’s used when performing an append or overwrite operation, to > > automatically adapt the schema to include one or more new columns. > > > > This feature is required in real-time data warehouse scenarios. > Currently, > > flink and spark engines do not support this feature. > > > > > > - *Finer fault tolerance* > > > > At present, most real-time processing engines will make the job fail when > > one of the tasks is failed. The main reason is that the downstream > operator > > depends on the calculation results of the upstream operator. However, in > > the scenario of data synchronization, the data is simply read from the > > source and then written to sink. It does not need to save the > intermediate > > result state. Therefore, the failure of one task will not affect whether > > the results of other tasks are correct. > > > > The new engine should provide more sophisticated fault-tolerant > management. > > It should support the failure of a single task without affecting the > > execution of other tasks. It should provide an interface so that users > can > > manually retry failed tasks instead of retrying the entire job. > > > > > > - *Speed Control* > > > > In Batch jobs, we need support speed control. Let users choose the > > synchronization speed they want to prevent too much impact on the source > or > > target database. > > > > > > > > *More Information* > > > > > > I make a simple design about SeaTunnel Engine. You can learn more > details > > in the following documents. > > > > > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub > > > > > > -- > > > > Best Regards > > > > ------------ > > > > Apache DolphinScheduler PMC > > > > Jun Gao > > [email protected] > > > > -- Best Regards ------------ Apache DolphinScheduler PMC Jun Gao [email protected]
