ssdong edited a comment on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-811619487
@jsbali To give out extra insights and details, as @zherenyu831 has posted in the beginning: ``` [20210323080718__replacecommit__COMPLETED]: size : 0 [20210323081449__replacecommit__COMPLETED]: size : 1 [20210323082046__replacecommit__COMPLETED]: size : 1 [20210323082758__replacecommit__COMPLETED]: size : 1 [20210323084004__replacecommit__COMPLETED]: size : 1 [20210323085044__replacecommit__COMPLETED]: size : 1 [20210323085823__replacecommit__COMPLETED]: size : 1 [20210323090550__replacecommit__COMPLETED]: size : 1 [20210323091700__replacecommit__COMPLETED]: size : 1 ``` If we keep everything the same and let archive logic handling everything, it would fail at 0 `partitionToReplaceFileIds` against `20210323080718__replacecommit__COMPLETED`(the first item in the list above), and this is a known issue. To make the archive work, we tried to _manually_ delete the first _empty_ commit file, which is `20210323080718__replacecommit__COMPLETED`(the first item in the list above). This has succeeded the archive, but instead, it has failed upon `User class threw exception: org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3://xxx/data/.hoodie/20210323081449.replacecommit`(the second item in the list above) Now to reason through the underlying mechanism of this error, given the archive was successful, that means a few commit files have been placed within the `.archive` folder, let's say ``` [20210323081449__replacecommit__COMPLETED]: size : 1 [20210323082046__replacecommit__COMPLETED]: size : 1 [20210323082758__replacecommit__COMPLETED]: size : 1 [20210323084004__replacecommit__COMPLETED]: size : 1 [20210323085044__replacecommit__COMPLETED]: size : 1 ``` have been successfully moved and placed in `.archive`. At this moment, the timeline has been updated and there are 3 remaining commit files which are: ``` [20210323085823__replacecommit__COMPLETED]: size : 1 [20210323090550__replacecommit__COMPLETED]: size : 1 [20210323091700__replacecommit__COMPLETED]: size : 1 ``` Now, if you pay attention to the stack trace which caused `User class threw exception: org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3://xxx/data/.hoodie/20210323081449.replacecommit`, and I am just pasting them again: ``` User class threw exception: org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3://xxx/data/.hoodie/20210323081449.replacecommit at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:530) at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:194) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$resetFileGroupsReplaced$8(AbstractTableFileSystemView.java:217) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:228) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.close(HoodieTableFileSystemView.java:353) at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707) at org.apache.hudi.common.table.view.FileSystemViewManager.close(FileSystemViewManager.java:118) at org.apache.hudi.timeline.service.TimelineService.close(TimelineService.java:179) at org.apache.hudi.client.embedded.EmbeddedTimelineService.stop(EmbeddedTimelineService.java:112) ``` After a `close` action being triggered on `TimelineService`, which is understandable, it propagates to `HoodieTableFileSystemView.close` and there is: ``` at org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106) at org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248) ``` happening right after it. Now I am not exactly sure why we need an `init` after `close` being called upon the `HoodieTableFileSystemView`.(Probably someone with deep knowledge could answer it). If you look at the source code, the `reset` and `init` are _initializing with new Hoodie timeline_. ( ``` @Override public final void reset() { try { writeLock.lock(); addedPartitions.clear(); resetViewState(); bootstrapIndex = null; // Initialize with new Hoodie timeline. init(metaClient, getTimeline()); } finally { writeLock.unlock(); } } ``` This above `getTimeline()` _didn't_ really fetch a _new_ timeline since TimelineService has been closed, and obviously `public void sync()` isn't being triggered, which resets the old timeline with the new ones. The Hudi table view's in-memory timeline remains the very _old_ timeline, i.e. the one _before_ doing the archive. If it tries to read those commits from in-memory and performance corresponding actions, it will certainly fail without a doubt since we have archived the commit files, and now they exist in the `.archive` folder. It does sound like a paradox here, given that it throws exceptions _after_ we manually delete the commit file to try to save the archive logic. Shouldn't it already exists from the beginning that even if we have a successful archiving action, the in-memory timeline remains old based on the mechanisms we do closing and initializing against a Hudi table view? Thoughts? 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org