satishkotha opened a new pull request #1396: [HUDI-687] Stop incremental reader 
on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396
 
 
   ## What is the purpose of the pull request
   example timeline:
   
   t0 -> create bucket1.parquet
   t1 -> create and append updates bucket1.log
   t2 -> request compaction
   t3 -> create bucket2.parquet
   
   if compaction at t2 takes a long time, incremental reads using 
HoodieParquetInputFormat may make progress to read commits at t3 and skip data 
ingested at t1 leading to 'data loss' .(Data will still be on disk, but 
incremental readers wont see it because its in log file and readers move to t3)
   
   To workaround this problem, we want to stop returning data belonging to 
commits > compaction_requested/inprogress_instant. After compaction is 
complete, incremental reader would see updates in t2, t3, so on. Disadvantage 
is that long running compactions can make it look like reader is 'stuck'. But 
that is better than skipping updates.
   
   ## Brief change log
   
   - Change HoodieParquetInputFormat to read commits prior to compaction instant
   - Added unit tests to validate behavior
   - Fix broken test utils for reading records
   
   ## Verify this pull request
   This change added tests and can be verified as follows:
   mvn test (TestMergeOnReadTable and TestHoodieActiveTimeline)
   
   Some discussion is on https://github.com/apache/incubator-hudi/pull/1389, 
sorry I messed up rebase, so resending as a new PR to avoid confusion
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to