aloyszhang opened a new issue, #9779: URL: https://github.com/apache/inlong/issues/9779
### Describe the proposal ## Motivation Currently, InLong provides real-time data synchronization based on the Flink engine, which has the advantage of low latency. Compared to real-time synchronization, offline data synchronization(not supported yet) pays more attention to synchronization throughput and efficiency. To enhance the usage scenarios of InLong, we plan to add support for offline data synchronization capability in InLong. The implementation is based on the Flink computing engine uniformly. Real-time synchronization tasks run in the manner of Flink stream tasks, while offline synchronization runs in the manner of Flink batch tasks. This approach can ensure the consistency of real-time and offline synchronization tasks' code as much as possible, reducing maintenance costs. ## Solution The offline synchronization feature of the InLong dataset integration provides sources and sinks for processing data, corresponding to data sources and destinations, and combines with the scheduling system to synchronize full or incremental data from the data source to the data target. InLong supports scheduling offline synchronization tasks by setting specific trigger times(including year, month, day, hour, and minute) through the scheduling system. Offline synchronization tasks are created by the Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module. ### Logical Architecture  ### Key Competency **Job Configuration**: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode. **Scheduling Configuration**: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode **Job Type**: Support Periodic Incremental Synchronization and Periodic Full Synchronization **Scheduling**: Built-in simple periodic scheduling capability, complex capabilities such as task dependencies are supported by third-party scheduling systems. **Data Source:** RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.) **Data Sink**: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.) **Compute Engine**: Flink **Offline Job Operation and Maintenance**: Job start,stop and running status monitoring **Special Handling**: Dirty Data Processing Capability ### Data Flow Architecture  1. The user creates an offline synchronization task. 2. The manager saves task information and scheduling information in the DB. 3. After task approval, the offline synchronization task information is encapsulated. 4. Register scheduling information with the scheduling system; InLong has a built-in simple scheduling solution (Quartz), while complete scheduling capabilities rely on third-party scheduling systems (DolphinScheduler, US, etc.). 5. The scheduling system regularly generates scheduling instances. 6. For the initial run, the manager constructs a Flink batch job. 7. Submit the Flink batch job to the Flink cluster. ### Task list ## new dev branch Since this is a big feature for InLong, so, create a new branch for development, and after development and testing are completed, merge it back to master. - [ ] create new dev branch ## Manager Offline Synchronization Task Management: Definition and Management of Offline Synchronization Tasks - [ ] Offline Synchronization task definition - [ ] Offline synchronization task management - [ ] Page wizard mode - [ ] OpenAPI mode Scheduling Management: Scheduling task definition, scheduling instance definition, scheduling task management (CRUD) - [ ] Definition of scheduling information, corresponding to each offline task - [ ] Scheduling information management - [ ] Page wizard mode - [ ] OpenAPI mode - [ ] Support for periodic scheduling capability - [ ] Scheduling instance definition - [ ] Scheduling interface abstraction - [ ] Plugin-based scheduling framework support - [ ] Built-in scheduling capability support (based on Quartz) - [ ] DolphinScheduler, US, etc. Offline Task Submission - [ ] Timing of Flink task submission determined by the scheduling system; submit Flink task when generating scheduling instance Offline Task Operation and Maintenance - [ ] Start (task submission), stop - [ ] Retrieve running status - [ ] Task logs, exceptions ## Sort Flink Task Encapsulation: Add support for Flink environment in batch mode Flink Batch Capability Support - [ ] Support Flink 1.18, upgrade Flink dependencies - [ ] Support Flink 1.18 connectors, connectors support batch mode operation ### InLong Component Other for not specified component ### Are you willing to submit PR? - [X] Yes, I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
