[GitHub] [incubator-hudi] satishkotha edited a comment on issue #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

GitBox Thu, 12 Mar 2020 13:44:29 -0700

satishkotha edited a comment on issue #1396: [HUDI-687] Stop incremental reader 
on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#issuecomment-598409536
 
 
   > > if compaction at t2 takes a long time, incremental reads using 
HoodieParquetInputFormat may make progress to read commits at t3
   > 
   > IIUC this is because you are incremental pulling from the parquet only 
table? I thought we can already incremental pull via logs. no? cc @n3nash .. is 
this really needed since it will add complexity to the system..
   > 
   > Eventually, I would like incremental query/pull on MOR to be just based on 
logs..
   
   Based on view type, hudi decides the input format to use (see 
https://github.com/apache/incubator-hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java#L91
 and line 143) . For RO views, we use HoodieParquetInputFormat, which does not 
read log files. For RT views, we use HoodieParquetRealtimeInputFormat, which 
reads slice including log files. In my limited testing, incremental reads on RT 
views also do not work well (we see duplicates after compaction in some 
conditions).  @bvaradar  is working on fixing any broken windows for supporting 
incremental reads on RT views.
   
   We wanted to include this change for supporting RO views (which is majority 
of use cases for us). I agree with you that this is additional complexity. I 
added more tests than usual because of that. 
   
   Other alternatives i can think of:
   1) Support incremental reads only for RT views.  incremental reads on RO can 
fail or use RT (is this your proposal in the above comment?)
   2) Instead of doing incremental reads based on hoodie commit time, use 
parquet file creation times. This approach requires substantial changes and 
likely be breaking some fundamental assumptions.
   
   Also, at a high level, I want to discuss adding additional mode for 
incremental reads. Today, its responsibility of hoodie users to save commit 
times and use that for next incremental reads. Can we add 'kafka consumer' 
model, where consumer only specifies their unique-id. Hudi tracks read progress 
(perhaps as part of consolidated metadata?). This would simplify usage and make 
debugging lot easier.
    
   fyi,  @n3nash is out of office for next 10 days. @bvaradar likely can share 
more context. Let me know if you have other suggestions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha edited a comment on issue #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

Reply via email to