[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-18 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-630081895


   I am looking on it now. thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-18 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-630035754


   About the manual file clean up, we could move the template file to a temp 
folder.
   And use a special `RecovrableWriter` that could know the temp folder. WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-18 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-630028998


   NP



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-17 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-629923445


   HI @aljoscha ,
   1. First of all, I think you are right we should merge to master after all 
the participants agree with the pr.
   2. I could split the commit to two commits: Update 
BucketStateSerierlizerTest and Rework. I think it is a good practice. Before do 
it I want to agree on how to change it. Using the `Bucket` not 
`InProgressFileWriter` or `RecoverableFsDataOutputStream` to produce 
pending-file/in-progress-file. The other two main concerns are the following. 
   3. I would want to share some thoughts about using a special 
`RcoverableWriter` tests the buck state migration. Currently, 
`Bucket`/`BucketState`/`BucketStateSerializer` depends on interface, not some 
special implementation. IMO any implementation of `RecoverableWriter` is a 
‘special’ one. The code path we want to test is bucket 
state(pendingFile) serializer(snapshot) and 
deserializer(restore). So I think it is ok if the implementation could test the 
code path. 
   4. For custom file copying/cleanup things. I think it is important that 
these PendingFile/InProgressFileWriter that are restored from v1 or v2 bytes 
should be work really. That the old version of `BucketStateSerierlizerTest` 
also verifies this. It’s why we do that.
   
   Of course, this is all my personal thoughts. Correct if I missing something.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-17 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-629779232


   Hi, @aljoscha 
   I could not restore bucket state through reading the snapshot of v1 in my 
local. I think one reason is that these snapshots contain absolute paths. such 
as
   
   > Expected 
:/Users/maguowei/project/flink.guowei/flink-streaming-java/src/test/resources/bucket-state-migration-test/only-in-progress-v2/bucket/test-bucket/
   Actual   
:/Users/aljoscha/Dev/flink/flink-streaming-java/src/test/resources/bucket-state-migration-test/only-in-progress-v2/bucket/test-bucket/
   <
   
   So I change the `BucketStateSerializerTest`
   1. `prepareDeserializationEmpty`& `prepareDeserializationOnlyInProgress` & 
`prepareDeserializationFull` & `prepareDeserializationNullInProgress` to 
produce only relative path.  So I think we should use the new version to 
produce the snapshot file.
   2. In order to avoid the test regression. I generated two template 
directories for `testSerializationNullInProgress`&`testSerializationFull` to 
test recovery Each test would copy the template directory to the 
temporary directory for testing. Delete after the temporary directory test is 
complete.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-16 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-629748020


   HI, @aljoscha @kl0u 
   I resolve all the comments. 
   1. Change the name of `PartFileFactory` to `BucketWriter`. Change the 
`PartFileWriter` to `InProgressFileWriter`.
   2. Update the `BucketStateSerializerTest`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-16 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-629650094


   Hi, @aljoscha thanks for your suggestion.
   1. yes you are right I should delete the v1 code in the test.
   1. The reason I used Operator to generate snapshots at the time is that we 
could use a set of code to generate both v1 and v2 snapshots of `BucketState`.
   1. Otherwise, you need to write twice for how to generate 
`BucketStatev1` and `BucketStateV2`. 
   1. There is a special process also need to write twice. The path in the 
snapshot cannot have an absolute path and needs to be specially processed when 
the snapshot is generated.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-16 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-629637583


   1. Yes. It is `BucketWriter` not `BulkWriter`.
   2. I got what do you mean. I would follow your suggestion. But I would want 
to explain why I want to name it `PartFileFactory` at first. I agree with you 
that `BucketWriter` does a lot of things except creating 
`InProgressFileWriter`. That's is so why I want to call it `PartFileFactory` 
not `PartFileWriterFactory` or `InProgressWriterFactory `.  :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-16 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-629604823


   Hi, @kl0u thanks for your review.
   1. You are right. We could rename the `PartFileWriter` to 
`InProgressFileWriter`.
   2. You are very careful. We should remove the `getRecoverable` from 
`PendingFile`. :-)
   3. Maybe I am missing something but I am a little wonder about the rename 
the `PartFileFacory` to the `BulkWriter`:
   1. The methods in the `BulkWriter` are all about how to handle the part 
file. Such as create/resume InProgressWriter which is responsible for writing 
data to the in-progress part file and recover the pending part file. 
   1. The subtype of `BulkWriter` is also about handling the part file 
differently such as format or commit and so on.
   4. So I think the old `PartFileFactory` is more suitable and we could call 
the `WriterProperties` to `PartFileFactoryProperites`. What do you think? 
Correct me if I misunderstand some thing.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [flink] guoweiM commented on pull request #12132: [FLINK-17593][Connectors/FileSystem] Support arbitrary recovery mechanism for PartFileWriter

2020-05-14 Thread GitBox


guoweiM commented on pull request #12132:
URL: https://github.com/apache/flink/pull/12132#issuecomment-628654263


   Thanks for your reminding. I would update pr latter today.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org