satishkotha edited a comment on issue #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction URL: https://github.com/apache/incubator-hudi/pull/1396#issuecomment-598409536 > > if compaction at t2 takes a long time, incremental reads using HoodieParquetInputFormat may make progress to read commits at t3 > > IIUC this is because you are incremental pulling from the parquet only table? I thought we can already incremental pull via logs. no? cc @n3nash .. is this really needed since it will add complexity to the system.. > > Eventually, I would like incremental query/pull on MOR to be just based on logs.. Based on view type, hudi decides the input format to use (see https://github.com/apache/incubator-hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java#L91 and line 143) . For RO views, we use HoodieParquetInputFormat, which does not read log files. For RT views, we use HoodieParquetRealtimeInputFormat, which reads slice including log files. In my limited testing, incremental reads on RT views also do not work well (we see duplicates after compaction in some conditions). @bvaradar is working on fixing any broken windows for supporting incremental reads on RT views. We wanted to include this change for supporting RO views (which is majority of use cases for us). I agree with you that this is additional complexity. I added more tests than usual because of that. Other alternatives i can think of: 1) Support incremental reads only for RT views. incremental reads on RO can fail or use RT (is this your proposal in the above comment?) 2) Instead of doing incremental reads based on hoodie commit time, use parquet file creation times. This approach requires substantial changes and likely be breaking some fundamental assumptions. Also, at a high level, I want to discuss adding additional mode for incremental reads. Today, its responsibility of hoodie users to save commit times and use that for next incremental reads. Can we add 'kafka consumer' model, where consumer only specifies their unique-id. Hudi tracks read progress (perhaps as part of consolidated metadata?). This would simplify usage and make debugging lot easier. fyi, @n3nash is out of office for next 10 days. @bvaradar likely can share more context. Let me know if you have other suggestions.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services