回复: [DISCUSS] Do we need to have our own engine

张涛 Sun, 05 Jun 2022 19:48:48 -0700

Developers and users will face three type of engine connectors(spark, flink, 
seatunnel),
this will be very painful for the users.


Best,
Liming
________________________________
发件人: 张 涛 <[email protected]>
发送时间: 2022年6月6日 10:24
收件人: [email protected] <[email protected]>
主题: 回复: [DISCUSS] Do we need to have our own engine


-1,

> At the same time reduce the user's component deployment and maintenance costs.

More engines will be unfriendly for users.

> I think both our own engine and Flink/Spark will exist in the short term.

Why get this conclusion?

> I think it is possible to achieve it step by step.

OK, this is not a good message to users.

BTW, apache Flink/Spark can also run standalone mode, not only cluster mode. 
This is a bad decision,
hoping that the SeaTunnel community can listen to other people's opinions.

Best,
Liming
________________________________
发件人: 范佳 <[email protected]>
发送时间: 2022年6月6日 10:06
收件人: [email protected] <[email protected]>
主题: Re: [DISCUSS] Do we need to have our own engine

+1 ,
         If we can implement the following features, it can help SeaTunnel 
provide better usability and performance. At the same time reduce the user's 
component deployment and maintenance costs.
        I think both our own engine and Flink/Spark will exist in the short 
term. For example, our engine can provide a simpler operation mode in a 
single-machine environment, and Flink/Spark provides a clustered operation 
mode. In the end, the replacement is the best result.
        To achieve such a large engine, I think it is possible to achieve it 
step by step.

> 2022年5月27日 18:06，JUN GAO <[email protected]> 写道：
>
> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>   - *Better resource utilization rate*
>
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
>
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
>
>
>   - *Fewer database connectors*
>
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
>
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization rate
> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
> we can reduce the number of database connections to an acceptable range.
>
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
>
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
>
>
>   - *Data Cache between Source and Sink*
>
>
>
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
>
> when the sink fails and can not write data. This is useful in some scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
>
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
>
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
>
> The execution plan like this:
>
>
>   - *Schema Evolution*
>
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
>
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
>
>
>   - *Finer fault tolerance*
>
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
>
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
>
>
>   - *Speed Control*
>
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
>
>
>
> *More Information*
>
>
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
>
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
> --
>
> Best Regards
>
> ------------
>
> Apache DolphinScheduler PMC
>
> Jun Gao
> [email protected]
>

回复: [DISCUSS] Do we need to have our own engine

Reply via email to