[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-630081895 I am looking on it now. thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-630035754 About the manual file clean up, we could move the template file to a temp folder. And use a special `RecovrableWriter` that could know the temp folder. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-630028998 NP This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-629923445 HI @aljoscha , 1. First of all, I think you are right we should merge to master after all the participants agree with the pr. 2. I could split the commit to two commits: Update BucketStateSerierlizerTest and Rework. I think it is a good practice. Before do it I want to agree on how to change it. Using the `Bucket` not `InProgressFileWriter` or `RecoverableFsDataOutputStream` to produce pending-file/in-progress-file. The other two main concerns are the following. 3. I would want to share some thoughts about using a special `RcoverableWriter` tests the buck state migration. Currently, `Bucket`/`BucketState`/`BucketStateSerializer` depends on interface, not some special implementation. IMO any implementation of `RecoverableWriter` is a ‘special’ one. The code path we want to test is bucket state(pendingFile) serializer(snapshot) and deserializer(restore). So I think it is ok if the implementation could test the code path. 4. For custom file copying/cleanup things. I think it is important that these PendingFile/InProgressFileWriter that are restored from v1 or v2 bytes should be work really. That the old version of `BucketStateSerierlizerTest` also verifies this. It’s why we do that. Of course, this is all my personal thoughts. Correct if I missing something. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-629779232 Hi, @aljoscha I could not restore bucket state through reading the snapshot of v1 in my local. I think one reason is that these snapshots contain absolute paths. such as > Expected :/Users/maguowei/project/flink.guowei/flink-streaming-java/src/test/resources/bucket-state-migration-test/only-in-progress-v2/bucket/test-bucket/ Actual :/Users/aljoscha/Dev/flink/flink-streaming-java/src/test/resources/bucket-state-migration-test/only-in-progress-v2/bucket/test-bucket/ < So I change the `BucketStateSerializerTest` 1. `prepareDeserializationEmpty`& `prepareDeserializationOnlyInProgress` & `prepareDeserializationFull` & `prepareDeserializationNullInProgress` to produce only relative path. So I think we should use the new version to produce the snapshot file. 2. In order to avoid the test regression. I generated two template directories for `testSerializationNullInProgress`&`testSerializationFull` to test recovery Each test would copy the template directory to the temporary directory for testing. Delete after the temporary directory test is complete. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-629748020 HI, @aljoscha @kl0u I resolve all the comments. 1. Change the name of `PartFileFactory` to `BucketWriter`. Change the `PartFileWriter` to `InProgressFileWriter`. 2. Update the `BucketStateSerializerTest`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-629650094 Hi, @aljoscha thanks for your suggestion. 1. yes you are right I should delete the v1 code in the test. 1. The reason I used Operator to generate snapshots at the time is that we could use a set of code to generate both v1 and v2 snapshots of `BucketState`. 1. Otherwise, you need to write twice for how to generate `BucketStatev1` and `BucketStateV2`. 1. There is a special process also need to write twice. The path in the snapshot cannot have an absolute path and needs to be specially processed when the snapshot is generated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-629637583 1. Yes. It is `BucketWriter` not `BulkWriter`. 2. I got what do you mean. I would follow your suggestion. But I would want to explain why I want to name it `PartFileFactory` at first. I agree with you that `BucketWriter` does a lot of things except creating `InProgressFileWriter`. That's is so why I want to call it `PartFileFactory` not `PartFileWriterFactory` or `InProgressWriterFactory `. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-629604823 Hi, @kl0u thanks for your review. 1. You are right. We could rename the `PartFileWriter` to `InProgressFileWriter`. 2. You are very careful. We should remove the `getRecoverable` from `PendingFile`. :-) 3. Maybe I am missing something but I am a little wonder about the rename the `PartFileFacory` to the `BulkWriter`: 1. The methods in the `BulkWriter` are all about how to handle the part file. Such as create/resume InProgressWriter which is responsible for writing data to the in-progress part file and recover the pending part file. 1. The subtype of `BulkWriter` is also about handling the part file differently such as format or commit and so on. 4. So I think the old `PartFileFactory` is more suitable and we could call the `WriterProperties` to `PartFileFactoryProperites`. What do you think? Correct me if I misunderstand some thing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter
guoweiM commented on pull request #12132: URL: https://github.com/apache/flink/pull/12132#issuecomment-628654263 Thanks for your reminding. I would update pr latter today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org