Re: [DISCUSS] PIP-4 Support savepoint

yu zelin Tue, 30 May 2023 00:40:20 -0700

Hi, Shammon,

An intuitive way is use numeric string to indicate snapshot and non-numeric 
string to indicate tag.
For example:


SELECT * FROM t VERSION AS OF 1  —to snapshot #1
SELECT * FROM t VERSION AS OF ‘last_year’ —to tag `last_year`

This is also how iceberg do [1]. 

However, if we use this way, the tag name cannot be numeric string. I think 
this is acceptable and I will add this to the document.

Best,
Yu Zelin

[1] https://iceberg.apache.org/docs/latest/spark-queries/#sql

> 2023年5月30日 12:17，Shammon FY <[email protected]> 写道：
> 
> Hi zelin,
> 
> Thanks for your update. I have one comment about Time Travel on savepoint.
> 
> Currently we can use statement in spark for specific snapshot 1
> SELECT * FROM t VERSION AS OF 1;
> 
> My point is how can we distinguish between snapshot and savepoint when
> users submit a statement as followed:
> SELECT * FROM t VERSION AS OF <version value>;
> 
> Best,
> Shammon FY
> 
> On Tue, May 30, 2023 at 11:37 AM yu zelin <[email protected]> wrote:
> 
>> Hi, Jingsong,
>> 
>> Thanks for your feedback.
>> 
>> ## TAG ID
>> It seems the id is useless currently. I’ll remove it.
>> 
>> ## Time Travel Syntax
>> Since tag id is removed, we can just use:
>> 
>> SELECT * FROM t VERSION AS OF ’tag-name’
>> 
>> to travel to a tag.
>> 
>> ## Tag class
>> I agree with you that we can reuse the Snapshot class. We can introduce
>> `TagManager`
>> only to manage tags.
>> 
>> ## Expiring Snapshot
>>> why not record it in ManifestEntry?
>> This is because every time Paimon generate a snapshot, it will create new
>> ManifestEntries
>> for data files. Consider this scenario, if we record it in ManifestEntry,
>> assuming     we commit
>> data file A to snapshot #1, we will get manifest entry Entry#1 as [ADD,
>> A, commit at #1].
>> Then we commit -A to snapshot #2, we will get manifest entry Entry#2 as
>> [DELETE, A, ?],
>> as you can see, we cannot know at which snapshot we commit the file A. So
>> we have to
>> record this information to data file meta directly.
>> 
>>> We should note that "record it in `DataFileMeta` should be done before
>> “tag”
>> and document version compatibility.
>> 
>> I will add message for this.
>> 
>> Best,
>> Yu Zelin
>> 
>> 
>>> 2023年5月29日 10:29，Jingsong Li <[email protected]> 写道：
>>> 
>>> Thanks Zelin for the update.
>>> 
>>> ## TAG ID
>>> 
>>> Is this useful? We have tag-name, snapshot-id, and now introducing a
>>> tag id? What is used?
>>> 
>>> ## Time Travel
>>> 
>>> SELECT * FROM t VERSION AS OF tag-name.<name>
>>> 
>>> This does not look like sql standard.
>>> 
>>> Why do we introduce this `tag-name` prefix?
>>> 
>>> ## Tag class
>>> 
>>> Why not just use the Snapshot class? It looks like we don't need to
>>> introduce Tag class. We can just copy the snapshot file to tag/.
>>> 
>>> ## Expiring Snapshot
>>> 
>>> We should note that "record it in `DataFileMeta`" should be done
>>> before "tag". And document version compatibility.
>>> And why not record it in ManifestEntry?
>>> 
>>> Best,
>>> Jingsong
>>> 
>>> On Fri, May 26, 2023 at 11:15 AM yu zelin <[email protected]> wrote:
>>>> 
>>>> Hi, all,
>>>> 
>>>> FYI, I have updated the PIP [1].
>>>> 
>>>> Main changes:
>>>> - Use new name `tag`
>>>> - Enrich Motivation
>>>> - New Section `Data Files Handling` to describe how to determine a data
>> files can be deleted.
>>>> 
>>>> Best,
>>>> Yu Zelin
>>>> 
>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw
>>>> 
>>>>> 2023年5月24日 17:18，yu zelin <[email protected]> 写道：
>>>>> 
>>>>> Hi, Guojun,
>>>>> 
>>>>> I’d like to share my thoughts about your questions.
>>>>> 
>>>>> 1. Expiration of savepoint
>>>>> In my opinion, savepoints are created in a long interval, so there
>> will not exist too many of them.
>>>>> If users create a savepoint per day, there are 365 savepoints a year.
>> So I didn’t consider expiration
>>>>> of it, and I think provide a flink action like `delete-savepoint id =
>> 1` is enough now.
>>>>> But if it is really important, we can introduce table options to do
>> so. I think we can do it like expiring
>>>>> snapshots.
>>>>> 
>>>>> 2. >   id of compacted snapshot picked by the savepoint
>>>>> My initial idea is picking a compacted snapshot or doing compaction
>> before creating savepoint. But
>>>>> After discuss with Jingsong, I found it’s difficult. So now I suppose
>> to directly create savepoint from
>>>>> the given snapshot. Maybe we can optimize it later.
>>>>> The changes will be updated soon.
>>>>>> manifest file list in system-table
>>>>> I think manifest file is not very important for users. Users can find
>> when a savepoint is created, and
>>>>> get the savepoint id, then they can query it from the savepoint by the
>> id. I did’t see what scenario
>>>>> the users need the manifest file information. What do you think?
>>>>> 
>>>>> Best,
>>>>> Yu Zelin
>>>>> 
>>>>>> 2023年5月24日 10:50，Guojun Li <[email protected]> 写道：
>>>>>> 
>>>>>> Thanks zelin for bringing up the discussion. I'm thinking about:
>>>>>> 1. How to manage the savepoints if there are no expiration mechanism,
>> by
>>>>>> the TTL management of storages or external script?
>>>>>> 2. I think the id of compacted snapshot picked by the savepoint and
>>>>>> manifest file list is also important information for users, could
>> these
>>>>>> information be stored in the system-table?
>>>>>> 
>>>>>> Best,
>>>>>> Guojun
>>>>>> 
>>>>>> On Mon, May 22, 2023 at 9:13 PM Jingsong Li <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> FYI
>>>>>>> 
>>>>>>> The PIP lacks a table to show Discussion thread & Vote thread &
>> ISSUE...
>>>>>>> 
>>>>>>> Best
>>>>>>> Jingsong
>>>>>>> 
>>>>>>> On Mon, May 22, 2023 at 4:48 PM yu zelin <[email protected]>
>> wrote:
>>>>>>>> 
>>>>>>>> Hi, all,
>>>>>>>> 
>>>>>>>> Thank all of you for your suggestions and questions. After reading
>> your
>>>>>>> suggestions, I adopt some of them and I want to share my opinions
>> here.
>>>>>>>> 
>>>>>>>> To make my statements more clear, I will still use the word
>> `savepoint`.
>>>>>>> When we make a consensus, the name may be changed.
>>>>>>>> 
>>>>>>>> 1. The purposes of savepoint
>>>>>>>> 
>>>>>>>> As Shammon mentioned, Flink and database also have the concept of
>>>>>>> `savepoint`. So it’s better to clarify the purposes of our savepoint.
>>>>>>> Thanks for Nicholas and Jingsong, I think your explanations are very
>> clear.
>>>>>>> I’d like to give my summary:
>>>>>>>> 
>>>>>>>> (1) Fault recovery (or we can say disaster recovery). Users can ROLL
>>>>>>> BACK to a savepoint if needed. If user rollbacks to a savepoint, the
>> table
>>>>>>> will hold the data in the savepoint and the data committed  after the
>>>>>>> savepoint will be deleted. In this scenario we need savepoint because
>>>>>>> snapshots may have expired, the savepoint can keep longer and save
>> user’s
>>>>>>> old data.
>>>>>>>> 
>>>>>>>> (2) Record versions of data at a longer interval (typically daily
>> level
>>>>>>> or weekly level). With savepoint, user can query the old data in
>> batch
>>>>>>> mode. Comparing to copy records to a new table or merge incremental
>> records
>>>>>>> with old records (like using merge into in Hive), the savepoint is
>> more
>>>>>>> lightweight because we don’t copy data files, we just record the
>> meta data
>>>>>>> of them.
>>>>>>>> 
>>>>>>>> As you can see, savepoint is very similar to snapshot. The
>> differences
>>>>>>> are:
>>>>>>>> 
>>>>>>>> (1) Savepoint lives longer. In most cases, snapshot’s life time is
>>>>>>> about several minutes to hours. We suppose the savepoint can live
>> several
>>>>>>> days, weeks, or even months.
>>>>>>>> 
>>>>>>>> (2) Savepoint is mainly used for batch reading for historical data.
>> In
>>>>>>> this PIP, we don’t introduce streaming reading for savepoint.
>>>>>>>> 
>>>>>>>> 2. Candidates of name
>>>>>>>> 
>>>>>>>> I agree with Jingsong that we can use a new name. Since the purpose
>> and
>>>>>>> mechanism (savepoint is very similar to snapshot) of savepoint is
>> similar
>>>>>>> to `tag` in iceberg, maybe we can use `tag`.
>>>>>>>> 
>>>>>>>> In my opinion, an alternative is `anchor`. All the snapshots are
>> like
>>>>>>> the navigation path of the streaming data, and an `anchor` can stop
>> it in a
>>>>>>> place.
>>>>>>>> 
>>>>>>>> 3. Public table operations and options
>>>>>>>> 
>>>>>>>> We supposed to expose some operations and table options for user to
>>>>>>> manage the savepoint.
>>>>>>>> 
>>>>>>>> (1) Operations (Currently for Flink)
>>>>>>>> We provide flink actions to manage savepoints:
>>>>>>>> create-savepoint: To generate a savepoint from latest snapshot.
>>>>>>> Support to create from specified snapshot.
>>>>>>>> delete-savepoint: To delete specified savepoint.
>>>>>>>> rollback-to: To roll back to a specified savepoint.
>>>>>>>> 
>>>>>>>> (2) Table options
>>>>>>>> We suppose to provide options for creating savepoint periodically:
>>>>>>>> savepoint.create-time: When to create the savepoint. Example: 00:00
>>>>>>>> savepoint.create-interval: Interval between the creation of two
>>>>>>> savepoints. Examples: 2 d.
>>>>>>>> savepoint.time-retained: The maximum time of savepoints to retain.
>>>>>>>> 
>>>>>>>> (3) Procedures (future work)
>>>>>>>> Spark supports SQL extension. After we support Spark CALL
>> statement, we
>>>>>>> can provide procedures to create, delete or rollback to savepoint
>> for Spark
>>>>>>> users.
>>>>>>>> 
>>>>>>>> Support of CALL is on the road map of Flink. In future version, we
>> can
>>>>>>> also support savepoint-related procedures for Flink users.
>>>>>>>> 
>>>>>>>> 4. Expiration of data files
>>>>>>>> 
>>>>>>>> Currently, when a snapshot is expired, data files that not be used
>> by
>>>>>>> other snapshots. After we introduce the savepoint, we must make sure
>> the
>>>>>>> data files saved by savepoint will not be deleted.
>>>>>>>> 
>>>>>>>> Conversely,  when a savepoint is deleted, the data files that not be
>>>>>>> used by existing snapshots and other savepoints will be deleted.
>>>>>>>> 
>>>>>>>> I have wrote some POC codes to implement it. I will update the
>> mechanism
>>>>>>> in PIP soon.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Yu Zelin
>>>>>>>> 
>>>>>>>>> 2023年5月21日 20:54，Jingsong Li <[email protected]> 写道：
>>>>>>>>> 
>>>>>>>>> Thanks Yun for your information.
>>>>>>>>> 
>>>>>>>>> We need to be careful to avoid confusion between Paimon and Flink
>>>>>>>>> concepts about "savepoint"
>>>>>>>>> 
>>>>>>>>> Maybe we don't have to insist on using this "savepoint", for
>> example,
>>>>>>>>> TAG is also a candidate just like Iceberg [1]
>>>>>>>>> 
>>>>>>>>> [1] https://iceberg.apache.org/docs/latest/branching/
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> Jingsong
>>>>>>>>> 
>>>>>>>>> On Sun, May 21, 2023 at 8:51 PM Jingsong Li <
>> [email protected]>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks Nicholas for your detailed requirements.
>>>>>>>>>> 
>>>>>>>>>> We need to supplement user requirements in FLIP, which is mainly
>> aimed
>>>>>>>>>> at two purposes:
>>>>>>>>>> 1. Fault recovery for data errors (named: restore or rollback-to)
>>>>>>>>>> 2. Used to record versions at the day level (such as), targeting
>>>>>>> batch queries
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong
>>>>>>>>>> 
>>>>>>>>>> On Sat, May 20, 2023 at 2:55 PM Yun Tang <[email protected]>
>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Guys,
>>>>>>>>>>> 
>>>>>>>>>>> Since we use Paimon with Flink in most cases, I think we need to
>>>>>>> identify the same word "savepoint" in different systems.
>>>>>>>>>>> 
>>>>>>>>>>> For Flink, savepoint means:
>>>>>>>>>>> 
>>>>>>>>>>> 1.  Triggered by users, not periodically triggered by the system
>>>>>>> itself. However, this FLIP wants to support it created periodically.
>>>>>>>>>>> 2.  Even the so-called incremental native savepoint [1], it will
>>>>>>> not depend on the previous checkpoints or savepoints, it will still
>> copy
>>>>>>> files on DFS to the self-contained savepoint folder. However, from
>> the
>>>>>>> description of this FLIP about the deletion of expired snapshot
>> files,
>>>>>>> paimion savepoint will refer to the previously existing files
>> directly.
>>>>>>>>>>> 
>>>>>>>>>>> I don't think we need to make the semantics of Paimon totally the
>>>>>>> same as Flink's. However, we need to introduce a table to tell the
>>>>>>> difference compared with Flink and discuss about the difference.
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Semantic
>>>>>>>>>>> 
>>>>>>>>>>> Best
>>>>>>>>>>> Yun Tang
>>>>>>>>>>> ________________________________
>>>>>>>>>>> From: Nicholas Jiang <[email protected]>
>>>>>>>>>>> Sent: Friday, May 19, 2023 17:40
>>>>>>>>>>> To: [email protected] <[email protected]>
>>>>>>>>>>> Subject: Re: [DISCUSS] PIP-4 Support savepoint
>>>>>>>>>>> 
>>>>>>>>>>> Hi Guys,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks Zelin for driving the savepoint proposal. I propose some
>>>>>>> opinions for savepont:
>>>>>>>>>>> 
>>>>>>>>>>> -- About "introduce savepoint for Paimon to persist full data in
>> a
>>>>>>> time point"
>>>>>>>>>>> 
>>>>>>>>>>> The motivation of savepoint proposal is more like snapshot TTL
>>>>>>> management. Actually, disaster recovery is very much mission
>> critical for
>>>>>>> any software. Especially when it comes to data systems, the impact
>> could be
>>>>>>> very serious leading to delay in business decisions or even wrong
>> business
>>>>>>> decisions at times. Savepoint is proposed to assist users in
>> recovering
>>>>>>> data from a previous state: "savepoint" and "restore".
>>>>>>>>>>> 
>>>>>>>>>>> "savepoint" saves the Paimon table as of the commit time,
>> therefore
>>>>>>> if there is a savepoint, the data generated in the corresponding
>> commit
>>>>>>> could not be clean. Meanwhile, savepoint could let user restore the
>> table
>>>>>>> to this savepoint at a later point in time if need be. On similar
>> lines,
>>>>>>> savepoint cannot be triggered on a commit that is already cleaned up.
>>>>>>> Savepoint is synonymous to taking a backup, just that we don't make
>> a new
>>>>>>> copy of the table, but just save the state of the table elegantly so
>> that
>>>>>>> we can restore it later when in need.
>>>>>>>>>>> 
>>>>>>>>>>> "restore" lets you restore your table to one of the savepoint
>>>>>>> commit. Meanwhile, it cannot be undone (or reversed) and so care
>> should be
>>>>>>> taken before doing a restore. At this time, Paimon would delete all
>> data
>>>>>>> files and commit files (timeline files) greater than the savepoint
>> commit
>>>>>>> to which the table is being restored.
>>>>>>>>>>> 
>>>>>>>>>>> BTW, it's better to introduce snapshot view based on savepoint,
>>>>>>> which could improve query performance of historical data for Paimon
>> table.
>>>>>>>>>>> 
>>>>>>>>>>> -- About Public API of savepont
>>>>>>>>>>> 
>>>>>>>>>>> Current introduced savepoint interfaces in Public API are not
>> enough
>>>>>>> for users, for example, deleteSavepoint, restoreSavepoint etc.
>>>>>>>>>>> 
>>>>>>>>>>> -- About "Paimon's savepoint need to be combined with Flink's
>>>>>>> savepoint":
>>>>>>>>>>> 
>>>>>>>>>>> If paimon supports savepoint mechanism and provides savepoint
>>>>>>> interfaces, the integration with Flink's savepoint is not blocked
>> for this
>>>>>>> proposal.
>>>>>>>>>>> 
>>>>>>>>>>> In summary, savepoint is not only used to improve the query
>>>>>>> performance of historical data, but also used for disaster recovery
>>>>>>> processing.
>>>>>>>>>>> 
>>>>>>>>>>> On 2023/05/17 09:53:11 Jingsong Li wrote:
>>>>>>>>>>>> What Shammon mentioned is interesting. I agree with what he said
>>>>>>> about
>>>>>>>>>>>> the differences in savepoints between databases and stream
>>>>>>> computing.
>>>>>>>>>>>> 
>>>>>>>>>>>> About "Paimon's savepoint need to be combined with Flink's
>>>>>>> savepoint":
>>>>>>>>>>>> 
>>>>>>>>>>>> I think it is possible, but we may need to deal with this in
>> another
>>>>>>>>>>>> mechanism, because the snapshots after savepoint may expire. We
>> need
>>>>>>>>>>>> to compare data between two savepoints to generate incremental
>> data
>>>>>>>>>>>> for streaming read.
>>>>>>>>>>>> 
>>>>>>>>>>>> But this may not need to block FLIP, it looks like the current
>>>>>>> design
>>>>>>>>>>>> does not break the future combination?
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jingsong
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, May 17, 2023 at 5:33 PM Shammon FY <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Caizhi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for your comments. As you mentioned, I think we may
>> need to
>>>>>>> discuss
>>>>>>>>>>>>> the role of savepoint in Paimon.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If I understand correctly, the main feature of savepoint in the
>>>>>>> current PIP
>>>>>>>>>>>>> is that the savepoint will not be expired, and users can
>> perform a
>>>>>>> query on
>>>>>>>>>>>>> the savepoint according to time-travel. Besides that, there is
>>>>>>> savepoint in
>>>>>>>>>>>>> the database and Flink.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. Savepoint in database. The database can roll back table
>> data to
>>>>>>> the
>>>>>>>>>>>>> specified 'version' based on savepoint. So the key point of
>>>>>>> savepoint in
>>>>>>>>>>>>> the database is to rollback data.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2. Savepoint in Flink. Users can trigger a savepoint with a
>>>>>>> specific
>>>>>>>>>>>>> 'path', and save all data of state to the savepoint for job.
>> Then
>>>>>>> users can
>>>>>>>>>>>>> create a new job based on the savepoint to continue consuming
>>>>>>> incremental
>>>>>>>>>>>>> data. I think the core capabilities are: backup for a job, and
>>>>>>> resume a job
>>>>>>>>>>>>> based on the savepoint.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> In addition to the above, Paimon may also face data write
>>>>>>> corruption and
>>>>>>>>>>>>> need to recover data based on the specified savepoint. So we
>> may
>>>>>>> need to
>>>>>>>>>>>>> consider what abilities should Paimon savepoint need besides
>> the
>>>>>>> ones
>>>>>>>>>>>>> mentioned in the current PIP?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Additionally, as mentioned above, Flink also has
>>>>>>>>>>>>> savepoint mechanism. During the process of streaming data from
>>>>>>> Flink to
>>>>>>>>>>>>> Paimon, does Paimon's savepoint need to be combined with
>> Flink's
>>>>>>> savepoint?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Shammon FY
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, May 17, 2023 at 4:02 PM Caizhi Weng <
>> [email protected]>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi developers!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks Zelin for bringing up the discussion. The proposal
>> seems
>>>>>>> good to me
>>>>>>>>>>>>>> overall. However I'd also like to bring up a few options.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. As Jingsong mentioned, Savepoint class should not become a
>>>>>>> public API,
>>>>>>>>>>>>>> at least for now. What we need to discuss for the public API
>> is
>>>>>>> how the
>>>>>>>>>>>>>> users can create or delete savepoints. For example, what the
>>>>>>> table option
>>>>>>>>>>>>>> looks like, what commands and options are provided for the
>> Flink
>>>>>>> action,
>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Currently most Flink actions are related to streaming
>>>>>>> processing, so
>>>>>>>>>>>>>> only Flink can support them. However, savepoint creation and
>>>>>>> deletion seems
>>>>>>>>>>>>>> like a feature for batch processing. So aside from Flink
>> actions,
>>>>>>> shall we
>>>>>>>>>>>>>> also provide something like Spark actions for savepoints?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I would also like to comment on Shammon's views.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Should we introduce an option for savepoint path which may be
>>>>>>> different
>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of
>> savepoint.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I don't see this is necessary. To backup a table the user just
>>>>>>> need to copy
>>>>>>>>>>>>>> all files from the table directory. Savepoint in Paimon, as
>> far
>>>>>>> as I
>>>>>>>>>>>>>> understand, is mainly for users to review historical data, not
>>>>>>> for backing
>>>>>>>>>>>>>> up tables.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Will the savepoint copy data files from snapshot or only save
>>>>>>> meta files?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It would be a heavy burden if a savepoint copies all its
>> files.
>>>>>>> As I
>>>>>>>>>>>>>> mentioned above, savepoint is not for backing up tables.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> How can users create a new table and restore data from the
>>>>>>> specified
>>>>>>>>>>>>>>> savepoint?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This reminds me of savepoints in Flink. Still, savepoint is
>> not
>>>>>>> for backing
>>>>>>>>>>>>>> up tables so I guess we don't need to support "restoring data"
>>>>>>> from a
>>>>>>>>>>>>>> savepoint.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Shammon FY <[email protected]> 于2023年5月17日周三 10:32写道：
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks Zelin for initiating this discussion. I have some
>>>>>>> comments:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. Should we introduce an option for savepoint path which
>> may be
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> from 'warehouse'? Then users can backup the data of
>> savepoint.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2. Will the savepoint copy data files from snapshot or only
>> save
>>>>>>> meta
>>>>>>>>>>>>>>> files? The description in the PIP "After we introduce
>> savepoint,
>>>>>>> we
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> also check if the data files are used by savepoints." looks
>> like
>>>>>>> we only
>>>>>>>>>>>>>>> save meta files for savepoint.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. How can users create a new table and restore data from the
>>>>>>> specified
>>>>>>>>>>>>>>> savepoint?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Shammon FY
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, May 17, 2023 at 10:19 AM Jingsong Li <
>>>>>>> [email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks Zelin for driving.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Some comments:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. I think it's possible to advance `Proposed Changes` to
>> the
>>>>>>> top,
>>>>>>>>>>>>>>>> Public API has no meaning if I don't know how to do it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. Public API, Savepoint and SavepointManager are not Public
>>>>>>> API, only
>>>>>>>>>>>>>>>> Flink action or configuration option should be public API.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 3.Maybe we can have a separate chapter to describe
>>>>>>>>>>>>>>>> `savepoint.create-interval`, maybe 'Periodically
>> savepoint'? It
>>>>>>> is not
>>>>>>>>>>>>>>>> just an interval, because the true user case is savepoint
>> after
>>>>>>> 0:00.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 4.About 'Interaction with Snapshot', to be continued ...
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Jingsong
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, May 16, 2023 at 7:07 PM yu zelin <
>> [email protected]
>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi, Paimon Devs,
>>>>>>>>>>>>>>>>> I’d like to start a discussion about PIP-4[1]. In this
>>>>>>> PIP, I
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> to talk about why we need savepoint, and some thoughts about
>>>>>>> managing
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> using savepoint. Look forward to your question and
>> suggestions.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Yu Zelin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1] https://cwiki.apache.org/confluence/x/NxE0Dw
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] PIP-4 Support savepoint

Reply via email to