What Shammon mentioned is interesting. I agree with what he said about the differences in savepoints between databases and stream computing.
About "Paimon's savepoint need to be combined with Flink's savepoint": I think it is possible, but we may need to deal with this in another mechanism, because the snapshots after savepoint may expire. We need to compare data between two savepoints to generate incremental data for streaming read. But this may not need to block FLIP, it looks like the current design does not break the future combination? Best, Jingsong On Wed, May 17, 2023 at 5:33 PM Shammon FY <[email protected]> wrote: > > Hi Caizhi, > > Thanks for your comments. As you mentioned, I think we may need to discuss > the role of savepoint in Paimon. > > If I understand correctly, the main feature of savepoint in the current PIP > is that the savepoint will not be expired, and users can perform a query on > the savepoint according to time-travel. Besides that, there is savepoint in > the database and Flink. > > 1. Savepoint in database. The database can roll back table data to the > specified 'version' based on savepoint. So the key point of savepoint in > the database is to rollback data. > > 2. Savepoint in Flink. Users can trigger a savepoint with a specific > 'path', and save all data of state to the savepoint for job. Then users can > create a new job based on the savepoint to continue consuming incremental > data. I think the core capabilities are: backup for a job, and resume a job > based on the savepoint. > > In addition to the above, Paimon may also face data write corruption and > need to recover data based on the specified savepoint. So we may need to > consider what abilities should Paimon savepoint need besides the ones > mentioned in the current PIP? > > Additionally, as mentioned above, Flink also has > savepoint mechanism. During the process of streaming data from Flink to > Paimon, does Paimon's savepoint need to be combined with Flink's savepoint? > > > Best, > Shammon FY > > > On Wed, May 17, 2023 at 4:02 PM Caizhi Weng <[email protected]> wrote: > > > Hi developers! > > > > Thanks Zelin for bringing up the discussion. The proposal seems good to me > > overall. However I'd also like to bring up a few options. > > > > 1. As Jingsong mentioned, Savepoint class should not become a public API, > > at least for now. What we need to discuss for the public API is how the > > users can create or delete savepoints. For example, what the table option > > looks like, what commands and options are provided for the Flink action, > > etc. > > > > 2. Currently most Flink actions are related to streaming processing, so > > only Flink can support them. However, savepoint creation and deletion seems > > like a feature for batch processing. So aside from Flink actions, shall we > > also provide something like Spark actions for savepoints? > > > > I would also like to comment on Shammon's views. > > > > Should we introduce an option for savepoint path which may be different > > > from 'warehouse'? Then users can backup the data of savepoint. > > > > > > > I don't see this is necessary. To backup a table the user just need to copy > > all files from the table directory. Savepoint in Paimon, as far as I > > understand, is mainly for users to review historical data, not for backing > > up tables. > > > > Will the savepoint copy data files from snapshot or only save meta files? > > > > > > > It would be a heavy burden if a savepoint copies all its files. As I > > mentioned above, savepoint is not for backing up tables. > > > > How can users create a new table and restore data from the specified > > > savepoint? > > > > > > This reminds me of savepoints in Flink. Still, savepoint is not for backing > > up tables so I guess we don't need to support "restoring data" from a > > savepoint. > > > > Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道: > > > > > Thanks Zelin for initiating this discussion. I have some comments: > > > > > > 1. Should we introduce an option for savepoint path which may be > > different > > > from 'warehouse'? Then users can backup the data of savepoint. > > > > > > 2. Will the savepoint copy data files from snapshot or only save meta > > > files? The description in the PIP "After we introduce savepoint, we > > should > > > also check if the data files are used by savepoints." looks like we only > > > save meta files for savepoint. > > > > > > 3. How can users create a new table and restore data from the specified > > > savepoint? > > > > > > Best, > > > Shammon FY > > > > > > > > > On Wed, May 17, 2023 at 10:19 AM Jingsong Li <[email protected]> > > > wrote: > > > > > > > Thanks Zelin for driving. > > > > > > > > Some comments: > > > > > > > > 1. I think it's possible to advance `Proposed Changes` to the top, > > > > Public API has no meaning if I don't know how to do it. > > > > > > > > 2. Public API, Savepoint and SavepointManager are not Public API, only > > > > Flink action or configuration option should be public API. > > > > > > > > 3.Maybe we can have a separate chapter to describe > > > > `savepoint.create-interval`, maybe 'Periodically savepoint'? It is not > > > > just an interval, because the true user case is savepoint after 0:00. > > > > > > > > 4.About 'Interaction with Snapshot', to be continued ... > > > > > > > > Best, > > > > Jingsong > > > > > > > > On Tue, May 16, 2023 at 7:07 PM yu zelin <[email protected]> > > wrote: > > > > > > > > > > Hi, Paimon Devs, > > > > > I’d like to start a discussion about PIP-4[1]. In this PIP, I > > want > > > > to talk about why we need savepoint, and some thoughts about managing > > and > > > > using savepoint. Look forward to your question and suggestions. > > > > > > > > > > Best, > > > > > Yu Zelin > > > > > > > > > > [1] https://cwiki.apache.org/confluence/x/NxE0Dw > > > > > > > > >
