[I] [Umbrella] InLong offline synchronization feature [inlong]

via GitHub Tue, 05 Mar 2024 02:55:45 -0800


aloyszhang opened a new issue, #9779:
URL: https://github.com/apache/inlong/issues/9779


   ### Describe the proposal
   
   ## Motivation 
   Currently, InLong provides real-time data synchronization based on the Flink 
engine, which has the advantage of low latency. Compared to real-time 
synchronization, offline data synchronization(not supported yet) pays more 
attention to synchronization throughput and efficiency. 
   
   To enhance the usage scenarios of InLong, we plan to add support for offline 
data synchronization capability in InLong. The implementation is based on the 
Flink computing engine uniformly. Real-time synchronization tasks run in the 
manner of Flink stream tasks, while offline synchronization runs in the manner 
of Flink batch tasks. This approach can ensure the consistency of real-time and 
offline synchronization tasks' code as much as possible, reducing maintenance 
costs.
   
   ## Solution
   The offline synchronization feature of the InLong dataset integration 
provides sources and sinks for processing data, corresponding to data sources 
and destinations, and combines with the scheduling system to synchronize full 
or incremental data from the data source to the data target.
   
   InLong supports scheduling offline synchronization tasks by setting specific 
trigger times(including year, month, day, hour, and minute) through the 
scheduling system. 
   
   Offline synchronization tasks are created by the Manager (including 
scheduling information), and the specific data synchronization logic is 
implemented through the InLong Sort module.
   
   ### Logical Architecture
   
![image](https://github.com/apache/inlong/assets/48062889/319469ac-c82b-4dfb-b858-5917a1bb6a89)
   
   ### Key Competency
   **Job Configuration**: Support Wizard Mode(Configuration through page 
wizard) and OpenAPI mode.
   
   **Scheduling Configuration**: Support Wizard Mode(Configuration through page 
wizard) and OpenAPI mode
   
   **Job Type**: Support Periodic Incremental Synchronization and  Periodic 
Full Synchronization
   
   **Scheduling**: Built-in simple periodic scheduling capability, complex 
capabilities such as task dependencies are supported by third-party scheduling 
systems.
   
   **Data Source:** RMDB, Message Queue and Big data 
storage(Hive,StarRocks,Iceberg etc.)
   
   **Data Sink**: RMDB, Message Queue and Big data 
storage(Hive,StarRocks,Iceberg etc.)
   
   **Compute Engine**: Flink
   
   **Offline Job Operation and Maintenance**: Job start,stop and running status 
monitoring
   
   **Special Handling**: Dirty Data Processing Capability
   ### Data Flow Architecture
   
![image](https://github.com/apache/inlong/assets/48062889/a0c83ac0-8011-4542-b311-ebe2d22dd141)
   1. The user creates an offline synchronization task.
   2. The manager saves task information and scheduling information in the DB.
   3. After task approval, the offline synchronization task information is 
encapsulated.
   4. Register scheduling information with the scheduling system; InLong has a 
built-in simple scheduling solution (Quartz), while complete scheduling 
capabilities rely on third-party scheduling systems (DolphinScheduler, US, 
etc.).
   5. The scheduling system regularly generates scheduling instances.
   6. For the initial run, the manager constructs a Flink batch job.
   7. Submit the Flink batch job to the Flink cluster.
   
   
   
   ### Task list
   
   ## new dev branch
   Since this is a big feature for InLong, so, create a new branch for 
development, and after development and testing are completed, merge it back to 
master.
   - [ ] create new dev branch
   ## Manager
   Offline Synchronization Task Management: Definition and Management of 
Offline Synchronization Tasks
   
   - [ ]  Offline Synchronization task definition
   - [ ]  Offline synchronization task management
     - [ ] Page wizard mode
     - [ ]  OpenAPI mode
   
   Scheduling Management: Scheduling task definition, scheduling instance 
definition, scheduling task management (CRUD)
   
   - [ ]  Definition of scheduling information, corresponding to each offline 
task
   - [ ] Scheduling information management
     - [ ]  Page wizard mode
     - [ ]  OpenAPI mode
   - [ ]  Support for periodic scheduling capability
     - [ ] Scheduling instance definition
     - [ ]  Scheduling interface abstraction
     - [ ]  Plugin-based scheduling framework support
       - [ ]  Built-in scheduling capability support (based on Quartz)
       - [ ]  DolphinScheduler, US, etc.
   
   Offline Task Submission
   
   - [ ] Timing of Flink task submission determined by the scheduling system; 
submit Flink task when generating scheduling instance
   
   Offline Task Operation and Maintenance
   
   - [ ]  Start (task submission), stop
   - [ ]  Retrieve running status
   - [ ]  Task logs, exceptions
   
   ## Sort
   Flink Task Encapsulation: Add support for Flink environment in batch mode
   
   Flink Batch Capability Support
   
   - [ ]  Support Flink 1.18, upgrade Flink dependencies
   - [ ]  Support Flink 1.18 connectors, connectors support batch mode operation
   
   ### InLong Component
   
   Other for not specified component
   
   ### Are you willing to submit PR?
   
   - [X] Yes, I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Umbrella] InLong offline synchronization feature [inlong]

Reply via email to