回复: [DISCUSS] Do we need to have our own engine

3656562 Sun, 05 Jun 2022 19:56:58 -0700

+1,
I don't think users need to face three engines, it just gives users a choice, 
just need to choose the right one, the `seatunnel engine` does not depend on 
spark/flink, as gaojun said, it solves a lot of problems, I think introducing  
`seatunnel engine` is a good idea.



在 2022年6月6日 10:48，张 涛<[email protected]> 写道：


Developers and users will face three type of engine connectors(spark, flink, 
seatunnel), this will be very painful for the users. Best, Liming 
________________________________ 发件人: 张 涛 <[email protected]> 发送时间: 
2022年6月6日 10:24 收件人: [email protected] <[email protected]> 主题: 
回复: [DISCUSS] Do we need to have our own engine -1, > At the same time reduce 
the user's component deployment and maintenance costs. More engines will be 
unfriendly for users. > I think both our own engine and Flink/Spark will exist 
in the short term. Why get this conclusion? > I think it is possible to achieve 
it step by step. OK, this is not a good message to users. BTW, apache 
Flink/Spark can also run standalone mode, not only cluster mode. This is a bad 
decision, hoping that the SeaTunnel community can listen to other people's 
opinions. Best, Liming ________________________________ 发件人: 范佳 
<[email protected]> 发送时间: 2022年6月6日 10:06 收件人: 
[email protected] <[email protected]> 主题: Re: [DISCUSS] Do we 
need to have our own engine +1 , If we can implement the following features, it 
can help SeaTunnel provide better usability and performance. At the same time 
reduce the user's component deployment and maintenance costs. I think both our 
own engine and Flink/Spark will exist in the short term. For example, our 
engine can provide a simpler operation mode in a single-machine environment, 
and Flink/Spark provides a clustered operation mode. In the end, the 
replacement is the best result. To achieve such a large engine, I think it is 
possible to achieve it step by step. > 2022年5月27日 18:06，JUN GAO 
<[email protected]> 写道： > > Why do we need the SeaTunnel Engine, And what 
problems do we want to solve? > > > - *Better resource utilization rate* > > 
Real time data synchronization is an important user scenario. Sometimes we > 
need real time synchronization of a full database. Now, Some common data > 
synchronization engine practices are one job per table. The advantage of > this 
practice is that one job failure does not influence another one. But > this 
practice will cause more waste of resources when most of the tables > only have 
a small amount of data. > > We hope the SeaTunnel Engine can solve this 
problem. We plan to support a > more flexible resource share strategy. It will 
allow some jobs to share the > resources when they submit by the same user. 
Users can even specify which > jobs share resources between them. If anyone has 
an idea, welcome to > discuss in the mail list or github issue. > > > - *Fewer 
database connectors* > > Another common problem in full database 
synchronization use CDC is each > table needs a database connector. This will 
put a lot of pressure on the db > server when there are a lot of tables in the 
database. > > Can we design the database connectors as a shared resource 
between jobs? > users can configure their database connectors pool. When a job 
uses the > connector pool, SeaTunnel Engine will init the connector pool at the 
node > which the source/sink connector at. And then push the connector pool in 
the > source/sink connector. With the feature of Better resource utilization 
rate > 
<,">https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
 > we can reduce the number of database connections to an acceptable range. > > 
Another way to reduce database connectors used by CDC Source Connector is > to 
make multiple table read support in CDC Source Connector. And then the > stream 
will be split by table name in the SeaTunnel Engine. > > This way reduces 
database connectors used by CDC Source Connector but it > can not reduce the 
database connectors used by sink if the synchronization > target is database 
too. So a shared database connector pool will be a good > way to solve it. > > 
> - *Data Cache between Source and Sink* > > > > Flume is an excellent data 
synchronization project. Flume Channel can cache > data > > when the sink fails 
and can not write data. This is useful in some scenarios. > For example, some 
users have limited time to save their database logs. CDC > Source Connector 
must ensure it can read database logs even if sink can not > write data. > > A 
feasible solution is to start two jobs. One job uses CDC Source > Connector to 
read database logs and then use Kafka Sink Connector to write > data to kafka. 
And another job uses Kafka Source Connector to read data > from kafka and then 
use the target Sink Connector to write data to the > target. This solution 
needs the user to have a deep understanding of > low-level technology, And two 
jobs will increase the difficulty of > operation and maintenance. Because every 
job needs a JobMaster, So it will > need more resources. > > Ideally, users 
only know they will read data from source and write data to > the sink and at 
the same time, in this process, the data can be cached in > case the sink 
fails. The synchronization engine needs to auto add cache > operation to the 
execution plan and ensure the source can work even if the > sink fails. In this 
process, the engine needs to ensure the data written to > the cache and read 
from the cache is transactional, this can ensure the > consistency of data. > > 
The execution plan like this: > > > - *Schema Evolution* > > Schema evolution 
is a feature that allows users to easily change a table’s > current schema to 
accommodate data that is changing over time. Most > commonly, it’s used when 
performing an append or overwrite operation, to > automatically adapt the 
schema to include one or more new columns. > > This feature is required in 
real-time data warehouse scenarios. Currently, > flink and spark engines do not 
support this feature. > > > - *Finer fault tolerance* > > At present, most 
real-time processing engines will make the job fail when > one of the tasks is 
failed. The main reason is that the downstream operator > depends on the 
calculation results of the upstream operator. However, in > the scenario of 
data synchronization, the data is simply read from the > source and then 
written to sink. It does not need to save the intermediate > result state. 
Therefore, the failure of one task will not affect whether > the results of 
other tasks are correct. > > The new engine should provide more sophisticated 
fault-tolerant management. > It should support the failure of a single task 
without affecting the > execution of other tasks. It should provide an 
interface so that users can > manually retry failed tasks instead of retrying 
the entire job. > > > - *Speed Control* > > In Batch jobs, we need support 
speed control. Let users choose the > synchronization speed they want to 
prevent too much impact on the source or > target database. > > > > *More 
Information* > > > I make a simple design about SeaTunnel Engine. You can learn 
more details > in the following documents. > > 
https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
 > > > -- > > Best Regards > > ------------ > > Apache DolphinScheduler PMC > > 
Jun Gao > [email protected] >

回复: [DISCUSS] Do we need to have our own engine

Reply via email to