Re: [PR] [HUDI-6979][RFC-76] support event time based compaction strategy [hudi]

via GitHub Mon, 11 Dec 2023 18:54:38 -0800


waitingF commented on code in PR #10266:
URL: https://github.com/apache/hudi/pull/10266#discussion_r1423345721



##########
rfc/rfc-76/rfc-76.md:
##########
@@ -0,0 +1,238 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-[74]: [support EventTimeBasedCompactionStrategy]
+
+## Proposers
+
+- @waitingF
+
+## Approvers
+ - @<approver1 github username>
+ - @<approver2 github username>
+
+## Status
+
+JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, 
which support appending log files and 
+compact log files into base file later. When querying the snapshot table (RT 
table) generated by MOR, 
+query side have to perform a compaction so that they can get all data, which 
is expected time-consuming causing query latency.
+At the time, hudi provide read-optimized table (RO table) for low query 
latency just like COW.
+
+But currently, there is no compaction strategy based on event time, so there 
is no data freshness guarantee for RO table.
+For cases, user want all data before a specified time, user have to query the 
RT table to get all data with expected high query latency.

Review Comment:
   > With our new file slicing under unbounded io compaction strategy, a 
compaction plan at t is designated as including all the log files complete 
before t, does that make sense to your use case? You can then query the ro 
table after the compaction completes, the ro table data freshness is at least 
up to t.
   
   I dont think so, as there will be file groups in pending compaction which 
will be skipped in scheduling compaction plan, in this case, it will break the 
rule that "the ro table data freshness is at least up to `t`", there may be 
history data in those file groups.
   We should ensure all log files before `t` being compacted, that means we 
should generate new plan if no file group in pending compaction/clustering, 
that is no pending compaction or clustering left. For this, we can introduce a 
new trigger.
   
   > The question is how to tell the reader the freshness of the ro table?
   
   Yeah, this is part of the rfc.  We can extract the freshness from log file 
during compacting



##########
rfc/rfc-76/rfc-76.md:
##########
@@ -0,0 +1,238 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-[74]: [support EventTimeBasedCompactionStrategy]
+
+## Proposers
+
+- @waitingF
+
+## Approvers
+ - @<approver1 github username>
+ - @<approver2 github username>
+
+## Status
+
+JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, 
which support appending log files and 

Review Comment:
   yeah, looks like so



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6979][RFC-76] support event time based compaction strategy [hudi]

Reply via email to