Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


vinothchandar merged PR #10255:
URL: https://github.com/apache/hudi/pull/10255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1939541288

   
   ## CI report:
   
   * 3153fe03616237ff7814c53923fec2a2504b6d01 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22420)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1939256594

   
   ## CI report:
   
   * a2cb2403ded21b17b991304af14eb57897b4100d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22357)
 
   * 3153fe03616237ff7814c53923fec2a2504b6d01 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22420)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1939244338

   
   ## CI report:
   
   * a2cb2403ded21b17b991304af14eb57897b4100d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22357)
 
   * 3153fe03616237ff7814c53923fec2a2504b6d01 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


codope commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1439208778


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) {

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


vinothchandar commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1486534637


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTime;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed after or on 'startTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-12 Thread via GitHub


vinothchandar commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1486523975


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-07 Thread via GitHub


vinothchandar commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1482177137


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java:
##
@@ -407,4 +407,12 @@ public static List> allOptions(Class 
clazz) {
 }
 return options;
   }
+
+  /**
+   * Whether the reader only consumes new commit instants.
+   */
+  public static boolean isOnlyConsumingNewCommits(Configuration conf) {

Review Comment:
   covered by UT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1932526613

   
   ## CI report:
   
   * a2cb2403ded21b17b991304af14eb57897b4100d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1932172445

   
   ## CI report:
   
   * 975fbb9de10066ac67a875f3484fe9f5922e197d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22346)
 
   * a2cb2403ded21b17b991304af14eb57897b4100d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1932157094

   
   ## CI report:
   
   * 975fbb9de10066ac67a875f3484fe9f5922e197d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22346)
 
   * a2cb2403ded21b17b991304af14eb57897b4100d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1931073583

   
   ## CI report:
   
   * 975fbb9de10066ac67a875f3484fe9f5922e197d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22346)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1931021650

   
   ## CI report:
   
   * d2c7c5cd379c4c8ba4297fd5fa1e989f73e58b5d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21854)
 
   * 975fbb9de10066ac67a875f3484fe9f5922e197d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22346)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1931014215

   
   ## CI report:
   
   * d2c7c5cd379c4c8ba4297fd5fa1e989f73e58b5d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21854)
 
   * 975fbb9de10066ac67a875f3484fe9f5922e197d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-02-06 Thread via GitHub


vinothchandar commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1480671223


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTime;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed after or on 'startTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-29 Thread via GitHub


vinothchandar commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1470522333


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'

Review Comment:
   what do we exactly mean by this ? `_hoodie_commit_time` is begin time and 
not completion time, right? Just for my understanding



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1880057543

   
   ## CI report:
   
   * d2c7c5cd379c4c8ba4297fd5fa1e989f73e58b5d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21854)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1880018357

   
   ## CI report:
   
   * b39e93881c130fd8832e5c260740f8f59bb8033e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21577)
 
   * d2c7c5cd379c4c8ba4297fd5fa1e989f73e58b5d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21854)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1880016530

   
   ## CI report:
   
   * b39e93881c130fd8832e5c260740f8f59bb8033e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21577)
 
   * d2c7c5cd379c4c8ba4297fd5fa1e989f73e58b5d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-04 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1442460176


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##
@@ -135,44 +127,37 @@ public static Builder builder() {
   public Result inputSplits(
   HoodieTableMetaClient metaClient,
   boolean cdcEnabled) {
-HoodieTimeline commitTimeline = getReadTimeline(metaClient);
-if (commitTimeline.empty()) {
-  LOG.warn("No splits found for the table under path " + path);
+
+IncrementalQueryAnalyzer analyzer = IncrementalQueryAnalyzer.builder()
+.metaClient(metaClient)
+.startTime(this.conf.getString(FlinkOptions.READ_START_COMMIT))
+.endTime(this.conf.getString(FlinkOptions.READ_END_COMMIT))

Review Comment:
   In streaming query, the user may declares nothing. And similar with Kafka, 
it consumes from the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-04 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1442460318


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java:
##
@@ -516,7 +517,7 @@ public static boolean fileExists(FileSystem fs, Path path) {
   public static boolean isWriteCommit(HoodieTableType tableType, HoodieInstant 
instant, HoodieTimeline timeline) {
 return tableType == HoodieTableType.MERGE_ON_READ
 ? !instant.getAction().equals(HoodieTimeline.COMMIT_ACTION) // not a 
compaction
-: !ClusteringUtil.isClusteringInstant(instant, timeline);   // not a 
clustering
+: !ClusteringUtils.isClusteringInstant(instant, timeline);   // not a 
clustering

Review Comment:
   Will do it in another PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-04 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1442459532


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/TestIncrementalInputSplits.java:
##
@@ -72,67 +76,77 @@
 public class TestIncrementalInputSplits extends HoodieCommonTestHarness {
 
   @BeforeEach
-  void init() throws IOException {
+  void init() {
 initPath();
-initMetaClient();
   }
 
   @Test
-  void testFilterInstantsWithRange() {
-HoodieActiveTimeline timeline = new HoodieActiveTimeline(metaClient, true);
+  void testFilterInstantsWithRange() throws IOException {
 Configuration conf = TestConfigurations.getDefaultConf(basePath);
 conf.set(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true);
-IncrementalInputSplits iis = IncrementalInputSplits.builder()
-.conf(conf)
-.path(new Path(basePath))
-.rowType(TestConfigurations.ROW_TYPE)
-
.skipClustering(conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING))
-.build();
+conf.set(FlinkOptions.TABLE_TYPE, FlinkOptions.TABLE_TYPE_MERGE_ON_READ);
+metaClient = HoodieTestUtils.init(basePath, HoodieTableType.MERGE_ON_READ);
 
+HoodieActiveTimeline timeline = metaClient.getActiveTimeline();
 HoodieInstant commit1 = new HoodieInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.DELTA_COMMIT_ACTION, "1");
 HoodieInstant commit2 = new HoodieInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.DELTA_COMMIT_ACTION, "2");
 HoodieInstant commit3 = new HoodieInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.DELTA_COMMIT_ACTION, "3");
 timeline.createCompleteInstant(commit1);
 timeline.createCompleteInstant(commit2);
 timeline.createCompleteInstant(commit3);
-timeline = timeline.reload();
+timeline = metaClient.reloadActiveTimeline();
+
+Map completionTimeMap = 
timeline.filterCompletedInstants().getInstantsAsStream()
+.collect(Collectors.toMap(HoodieInstant::getTimestamp, 
HoodieInstant::getCompletionTime));
 
+IncrementalQueryAnalyzer analyzer1 = IncrementalQueryAnalyzer.builder()
+.metaClient(metaClient)
+.rangeType(InstantRange.RangeType.OPEN_CLOSE)
+.startTime(completionTimeMap.get("1"))
+.skipClustering(true)
+.build();
 // previous read iteration read till instant time "1", next read iteration 
should return ["2", "3"]
-List instantRange2 = iis.filterInstantsWithRange(timeline, 
"1");
-assertEquals(2, instantRange2.size());
-assertIterableEquals(Arrays.asList(commit2, commit3), instantRange2);
+List activeInstants1 = 
analyzer1.analyze().getActiveInstants();
+assertEquals(2, activeInstants1.size());
+assertIterableEquals(Arrays.asList(commit2, commit3), activeInstants1);
 
 // simulate first iteration cycle with read from the LATEST commit
-List instantRange1 = iis.filterInstantsWithRange(timeline, 
null);
-assertEquals(1, instantRange1.size());
-assertIterableEquals(Collections.singletonList(commit3), instantRange1);
+IncrementalQueryAnalyzer analyzer2 = IncrementalQueryAnalyzer.builder()
+.metaClient(metaClient)
+.rangeType(InstantRange.RangeType.CLOSE_CLOSE)
+.skipClustering(true)
+.build();
+List activeInstants2 = 
analyzer2.analyze().getActiveInstants();
+assertEquals(1, activeInstants2.size());
+assertIterableEquals(Collections.singletonList(commit3), activeInstants2);
 
 // specifying a start and end commit
-conf.set(FlinkOptions.READ_START_COMMIT, "1");
-conf.set(FlinkOptions.READ_END_COMMIT, "3");
-List instantRange3 = iis.filterInstantsWithRange(timeline, 
null);
-assertEquals(3, instantRange3.size());
-assertIterableEquals(Arrays.asList(commit1, commit2, commit3), 
instantRange3);
+IncrementalQueryAnalyzer analyzer3 = IncrementalQueryAnalyzer.builder()

Review Comment:
   Should already been tested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-04 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1442458819


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-04 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1442458339


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';

Review Comment:
   Yeah, the doc can be improved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-04 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1442457312


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _

Review Comment:
   Archived instants included. From users standpoint, there is no notion for 
active or archived



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2024-01-01 Thread via GitHub


codope commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1439195459


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,428 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _

Review Comment:
   What is `earliest` here? Is it the earliest instant in the active, or is it 
including archived? Also, from the doc it is not clear why the instant 
filtering predicate is empty?



##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/TestIncrementalInputSplits.java:
##
@@ -72,67 +76,77 @@
 public class TestIncrementalInputSplits extends HoodieCommonTestHarness {
 
   @BeforeEach
-  void init() throws IOException {
+  void init() {
 initPath();
-initMetaClient();
   }
 
   @Test
-  void testFilterInstantsWithRange() {
-HoodieActiveTimeline timeline = new HoodieActiveTimeline(metaClient, true);
+  void testFilterInstantsWithRange() throws IOException {
 Configuration conf = TestConfigurations.getDefaultConf(basePath);
 conf.set(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true);
-IncrementalInputSplits iis = IncrementalInputSplits.builder()
-.conf(conf)
-.path(new Path(basePath))
-.rowType(TestConfigurations.ROW_TYPE)
-
.skipClustering(conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING))
-.build();
+conf.set(FlinkOptions.TABLE_TYPE, FlinkOptions.TABLE_TYPE_MERGE_ON_READ);
+metaClient = HoodieTestUtils.init(basePath, HoodieTableType.MERGE_ON_READ);
 
+HoodieActiveTimeline timeline = metaClient.getActiveTimeline();
 HoodieInstant commit1 = new HoodieInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.DELTA_COMMIT_ACTION, "1");
 HoodieInstant commit2 = new HoodieInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.DELTA_COMMIT_ACTION, "2");
 HoodieInstant commit3 = new HoodieInstant(HoodieInstant.State.COMPLETED, 
HoodieTimeline.DELTA_COMMIT_ACTION, "3");
 timeline.createCompleteInstant(commit1);
 timeline.createCompleteInstant(commit2);
 timeline.createCompleteInstant(commit3);
-timeline = timeline.reload();
+timeline = metaClient.reloadActiveTimeline();
+
+Map completionTimeMap = 
timeline.filterCompletedInstants().getInstantsAsStream()
+.collect(Collectors.toMap(HoodieInstant::getTimestamp, 
HoodieInstant::getCompletionTime));
 
+IncrementalQueryAnalyzer analyzer1 = IncrementalQueryAnalyzer.builder()
+

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-18 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1862221288

   
   ## CI report:
   
   * b39e93881c130fd8832e5c260740f8f59bb8033e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21577)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-18 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1862087491

   
   ## CI report:
   
   * 3e12be6f351ee6da145de331af53c27c985b7d25 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21488)
 
   * b39e93881c130fd8832e5c260740f8f59bb8033e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21577)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-18 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1862081603

   
   ## CI report:
   
   * 3e12be6f351ee6da145de331af53c27c985b7d25 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21488)
 
   * b39e93881c130fd8832e5c260740f8f59bb8033e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1858800572

   > Overall it looks solid to me. Only concern is that this code is very 
critical and complex, do we have enough tests to ensure the correctness?
   
   We have all the tests for each use case in `TestInputFormat` and 
`TestIncrementalInputSplits`, they are currently Flink UTs, somehow we need to 
refactoring the tests into the `hudi-common`. (an engine agnostic data set 
validation).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428788123


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,

Review Comment:
   It could be, and that would induce empty dataset.



-- 
This is an automated message from the 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787998


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +187,111 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param rangeStart   The query range start completion time.
+   * @param rangeEnd The query range end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTimes(
+  HoodieTimeline readTimeline,
+  Option rangeStart,
+  Option rangeEnd,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTimes(readTimeline, rangeStart, rangeEnd, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
   }
 
   /**
* Queries the instant start time with given completion time range.
*
-   * @param startCompletionTime   The start completion time.
-   * @param endCompletionTime The end completion time.
-   * @param earliestStartTimeFunc The function to generate the earliest start 
time boundary
-   *  with the minimum completion time {@code 
startCompletionTime}.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime, Function earliestStartTimeFunc) {
-String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+  @VisibleForTesting
+  public List getStartTimes(
+  String rangeStart,
+  String rangeEnd,
+  Function earliestInstantTimeFunc) {
+return 
getStartTimes(metaClient.getCommitsTimeline().filterCompletedInstants(), 
Option.ofNullable(rangeStart), Option.ofNullable(rangeEnd),
+InstantRange.RangeType.CLOSE_CLOSE, earliestInstantTimeFunc);
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param readTimelineThe read timeline.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param rangeType   The range type.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  public List getStartTimes(

Review Comment:
   The completion time is user specified, the main purpose of this clazz is 
transfering the ocmpletion time range into instant time range.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787710


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787379


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787494


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428787116


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428786843


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -166,6 +169,28 @@ public boolean isInRange(String instant) {
 }
   }
 
+  /**
+   * Composition of multiple instant ranges in disjunctive form.

Review Comment:
   Yeah, the `conjunctive` and `disjunctive` is terminology for relation 
algebra(SQL), `disjunctive` usually means `OR`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-16 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428786517


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {

Review Comment:
   Kind of, we can rename to better one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1858435541

   Overall it looks solid to me. Only concern is that this code is very 
critical and complex, do we have enough tests to ensure the correctness?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428428388


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,

Review Comment:
   @danny0405 , is it possible that the entime is older than the last instant 
timestamp? 



-- 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428422995


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +187,111 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param rangeStart   The query range start completion time.
+   * @param rangeEnd The query range end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTimes(
+  HoodieTimeline readTimeline,
+  Option rangeStart,
+  Option rangeEnd,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTimes(readTimeline, rangeStart, rangeEnd, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
   }
 
   /**
* Queries the instant start time with given completion time range.
*
-   * @param startCompletionTime   The start completion time.
-   * @param endCompletionTime The end completion time.
-   * @param earliestStartTimeFunc The function to generate the earliest start 
time boundary
-   *  with the minimum completion time {@code 
startCompletionTime}.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime, Function earliestStartTimeFunc) {
-String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+  @VisibleForTesting
+  public List getStartTimes(
+  String rangeStart,
+  String rangeEnd,
+  Function earliestInstantTimeFunc) {
+return 
getStartTimes(metaClient.getCommitsTimeline().filterCompletedInstants(), 
Option.ofNullable(rangeStart), Option.ofNullable(rangeEnd),
+InstantRange.RangeType.CLOSE_CLOSE, earliestInstantTimeFunc);
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param readTimelineThe read timeline.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param rangeType   The range type.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  public List getStartTimes(

Review Comment:
   Is it possible for us to return the completion time directly since the range 
time and range end is completion time, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428422995


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +187,111 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param rangeStart   The query range start completion time.
+   * @param rangeEnd The query range end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTimes(
+  HoodieTimeline readTimeline,
+  Option rangeStart,
+  Option rangeEnd,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTimes(readTimeline, rangeStart, rangeEnd, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
   }
 
   /**
* Queries the instant start time with given completion time range.
*
-   * @param startCompletionTime   The start completion time.
-   * @param endCompletionTime The end completion time.
-   * @param earliestStartTimeFunc The function to generate the earliest start 
time boundary
-   *  with the minimum completion time {@code 
startCompletionTime}.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime, Function earliestStartTimeFunc) {
-String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+  @VisibleForTesting
+  public List getStartTimes(
+  String rangeStart,
+  String rangeEnd,
+  Function earliestInstantTimeFunc) {
+return 
getStartTimes(metaClient.getCommitsTimeline().filterCompletedInstants(), 
Option.ofNullable(rangeStart), Option.ofNullable(rangeEnd),
+InstantRange.RangeType.CLOSE_CLOSE, earliestInstantTimeFunc);
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param readTimelineThe read timeline.
+   * @param rangeStart  The query range start completion time.
+   * @param rangeEndThe query range end completion time.
+   * @param rangeType   The range type.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  public List getStartTimes(

Review Comment:
   Is it possible for us to return the completion time directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428418974


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428383262


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428382099


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428371728


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428363554


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-15 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1428354941


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -166,6 +169,28 @@ public boolean isInRange(String instant) {
 }
   }
 
+  /**
+   * Composition of multiple instant ranges in disjunctive form.

Review Comment:
   Here by "disjuntive", do you mean the relation between these range is "OR", 
or do you mean these ranges are not consecutive? Meanwhile, in this function, 
we don't need to know their relation at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427516351


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +187,111 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param rangeStart   The query range start completion time.
+   * @param rangeEnd The query range end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTimes(
+  HoodieTimeline readTimeline,
+  Option rangeStart,
+  Option rangeEnd,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.

Review Comment:
   That would be rare case, we can make it configurable if needed, but 1 day is 
generally enough for most of the use cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427515924


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427515234


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,

Review Comment:
   Both our instant time/completion time are all formatted string in format: 
`MMddHHmmssSSS`.




Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427514359


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;

Review Comment:
   > instant == instant candidates
   
   yeah.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427422770


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {

Review Comment:
   the name of the class is a bit broader than what it does. To me, it is more 
like TimeLineAnalyzer something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427414475


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +187,111 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param rangeStart   The query range start completion time.
+   * @param rangeEnd The query range end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTimes(
+  HoodieTimeline readTimeline,
+  Option rangeStart,
+  Option rangeEnd,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.

Review Comment:
   How does this assumption come from? For large data replication or sql 
queries, it can be more than one day.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427389589


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;

Review Comment:
   Nvm, i have figured it out from the following code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427389589


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;

Review Comment:
   Nvm, i have figured it out from the followding code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427384608


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427381908


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final Option startTime;
+  private final Option endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,

Review Comment:
   Why do we use "String" type for time? Shouldn't "Long"?



-- 
This is an automated message from 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427379166


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;

Review Comment:
   Question: What is the criterion to qualify for being active? Do we use a 
time threshold?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427374230


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;

Review Comment:
   instant == instant candidates? If yes, we should stick to use "instant". 
otherwise, please explain the difference between them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-14 Thread via GitHub


linliu-code commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1427374230


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;

Review Comment:
   instant == instant candidates?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1854029879

   
   ## CI report:
   
   * 3e12be6f351ee6da145de331af53c27c985b7d25 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21488)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1853676119

   
   ## CI report:
   
   * 3787083e09a0bf7d754d095b5826657efd4f468e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21395)
 
   * 3e12be6f351ee6da145de331af53c27c985b7d25 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21488)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1853664347

   
   ## CI report:
   
   * 3787083e09a0bf7d754d095b5826657efd4f468e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21395)
 
   * 3e12be6f351ee6da145de331af53c27c985b7d25 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1425055040


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +190,109 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param startTimeThe start completion time.
+   * @param endTime  The end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTime(
+  HoodieTimeline readTimeline,
+  @Nullable String startTime,
+  @Nullable String endTime,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTime(readTimeline, startTime, endTime, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param startTime   The start completion time.
+   * @param endTime The end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  @VisibleForTesting
+  public List getStartTime(
+  @Nullable String startTime,
+  @Nullable String endTime,
+  Function earliestInstantTimeFunc) {
+return 
getStartTime(metaClient.getCommitsTimeline().filterCompletedInstants(), 
startTime, endTime, InstantRange.RangeType.CLOSE_CLOSE, 
earliestInstantTimeFunc);
   }
 
   /**
* Queries the instant start time with given completion time range.
*
-   * @param startCompletionTime   The start completion time.
-   * @param endCompletionTime The end completion time.
-   * @param earliestStartTimeFunc The function to generate the earliest start 
time boundary
-   *  with the minimum completion time {@code 
startCompletionTime}.
+   * @param readTimelineThe read timeline.
+   * @param startTime   The start completion time.
+   * @param endTime The end completion time.
+   * @param rangeType   The range type.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime, Function earliestStartTimeFunc) {
-String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+  public List getStartTime(
+  HoodieTimeline readTimeline,
+  @Nullable String startTime,
+  @Nullable String endTime,
+  InstantRange.RangeType rangeType,
+  Function earliestInstantTimeFunc) {
+final boolean startFromEarliest = 
START_COMMIT_EARLIEST.equalsIgnoreCase(startTime);
+String earliestInstantToLoad = null;
+if (startTime != null && !startFromEarliest) {
+  earliestInstantToLoad = earliestInstantTimeFunc.apply(startTime);
+} else if (endTime != null) {
+  earliestInstantToLoad = earliestInstantTimeFunc.apply(endTime);
+}
+
+// ensure the earliest instant boundary be loaded.
+if (earliestInstantToLoad != null && 
HoodieTimeline.compareTimestamps(this.cursorInstant, GREATER_THAN, 
earliestInstantToLoad)) {
+  loadCompletionTimeIncrementally(earliestInstantToLoad);
+}
+
+if (startTime == null && endTime != null) {
+  // returns the last instant that finished at or before the given 
completion time 'endTime'.
+  String maxInstantTime = readTimeline.getInstantsAsStream()
+  .filter(instant -> instant.isCompleted() && 
HoodieTimeline.compareTimestamps(instant.getCompletionTime(), 
LESSER_THAN_OR_EQUALS, endTime))
+  
.max(Comparator.comparing(HoodieInstant::getCompletionTime)).map(HoodieInstant::getTimestamp).orElse(null);
+  if (maxInstantTime != null) {
+return Collections.singletonList(maxInstantTime);
+  }
+  // fallback to archived timeline
+  return this.startToCompletionInstantTimeMap.entrySet().stream()

Review Comment:
   We should, here we want to filter the timeline 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424806427


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +190,109 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param startTimeThe start completion time.
+   * @param endTime  The end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTime(
+  HoodieTimeline readTimeline,
+  @Nullable String startTime,
+  @Nullable String endTime,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTime(readTimeline, startTime, endTime, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param startTime   The start completion time.
+   * @param endTime The end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  @VisibleForTesting
+  public List getStartTime(

Review Comment:
   Currently we are using start instant time, i'm okay if we want to switch to 
'begin', have no preference for naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1425037680


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,425 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final String startTime;
+  private final String endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) {
+this.metaClient = metaClient;
+

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1425028862


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -60,8 +63,8 @@ public String getEndInstant() {
   @Override
   public String toString() {
 return "InstantRange{"
-+ "startInstant='" + startInstant == null ? "null" : startInstant + 
'\''
-+ ", endInstant='" + endInstant == null ? "null" : endInstant + '\''
++ "startInstant='" + (startInstant == null ? "null" : startInstant) + 
'\''
++ ", endInstant='" + (endInstant == null ? "null" : endInstant) + '\''
 + ", rangeType='" + this.getClass().getSimpleName() + '\''

Review Comment:
   yeah, we do not hold the range type in the range object, so the class name 
makes sense, will refactor it when moving the package.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1425028862


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -60,8 +63,8 @@ public String getEndInstant() {
   @Override
   public String toString() {
 return "InstantRange{"
-+ "startInstant='" + startInstant == null ? "null" : startInstant + 
'\''
-+ ", endInstant='" + endInstant == null ? "null" : endInstant + '\''
++ "startInstant='" + (startInstant == null ? "null" : startInstant) + 
'\''
++ ", endInstant='" + (endInstant == null ? "null" : endInstant) + '\''
 + ", rangeType='" + this.getClass().getSimpleName() + '\''

Review Comment:
   yeah



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-13 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1425028379


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -22,9 +22,12 @@
 import org.apache.hudi.common.util.ValidationUtils;
 
 import java.io.Serializable;
+import java.util.Arrays;
 import java.util.Collections;
+import java.util.List;
 import java.util.Objects;
 import java.util.Set;
+import java.util.stream.Collectors;
 
 /**
  * A instant commits range used for incremental reader filtering.

Review Comment:
   Will address it in another PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424875356


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,425 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active
+ * '_hoodie_commit_time' = i_n
+ *   
+ *   
+ * [startTime, _]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed before or on 'endTime';
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ *   
+ * [earliest, endTime]
+ * i).find the instant set setA, setA is a collection of all the 
instants completed in the given time range;
+ * ii). read the latest snapshot from table metadata if setA has archived 
instants or the commit metadata if all the instants are still active
+ * '_hoodie_commit_time' in setA
+ *   
+ * 
+ *
+ *  A range type is required for analyzing the query so that the query 
range boundary inclusiveness have clear semantics.
+ *
+ * IMPORTANT: the reader may optionally choose to fall back to reading the 
latest snapshot if there are files missing from decoding the commit metadata.
+ */
+public class IncrementalQueryAnalyzer {
+  public static final String START_COMMIT_EARLIEST = "earliest";
+
+  private final HoodieTableMetaClient metaClient;
+  private final String startTime;
+  private final String endTime;
+  private final InstantRange.RangeType rangeType;
+  private final boolean skipCompaction;
+  private final boolean skipClustering;
+  private final int limit;
+
+  private IncrementalQueryAnalyzer(
+  HoodieTableMetaClient metaClient,
+  String startTime,
+  String endTime,
+  InstantRange.RangeType rangeType,
+  boolean skipCompaction,
+  boolean skipClustering,
+  int limit) {
+this.metaClient = metaClient;
+

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424874743


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,425 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]

Review Comment:
   We always assume `_` as the latest available commit, which is more matching 
to the streaming semantics, when both start and end time are `_`, just return 
the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424872867


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +190,109 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param startTimeThe start completion time.
+   * @param endTime  The end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTime(
+  HoodieTimeline readTimeline,
+  @Nullable String startTime,
+  @Nullable String endTime,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTime(readTimeline, startTime, endTime, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param startTime   The start completion time.
+   * @param endTime The end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  @VisibleForTesting
+  public List getStartTime(
+  @Nullable String startTime,
+  @Nullable String endTime,
+  Function earliestInstantTimeFunc) {
+return 
getStartTime(metaClient.getCommitsTimeline().filterCompletedInstants(), 
startTime, endTime, InstantRange.RangeType.CLOSE_CLOSE, 
earliestInstantTimeFunc);
   }
 
   /**
* Queries the instant start time with given completion time range.
*
-   * @param startCompletionTime   The start completion time.
-   * @param endCompletionTime The end completion time.
-   * @param earliestStartTimeFunc The function to generate the earliest start 
time boundary
-   *  with the minimum completion time {@code 
startCompletionTime}.
+   * @param readTimelineThe read timeline.
+   * @param startTime   The start completion time.
+   * @param endTime The end completion time.
+   * @param rangeType   The range type.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime, Function earliestStartTimeFunc) {
-String startInstant = earliestStartTimeFunc.apply(startCompletionTime);
+  public List getStartTime(
+  HoodieTimeline readTimeline,
+  @Nullable String startTime,
+  @Nullable String endTime,
+  InstantRange.RangeType rangeType,
+  Function earliestInstantTimeFunc) {
+final boolean startFromEarliest = 
START_COMMIT_EARLIEST.equalsIgnoreCase(startTime);
+String earliestInstantToLoad = null;
+if (startTime != null && !startFromEarliest) {
+  earliestInstantToLoad = earliestInstantTimeFunc.apply(startTime);
+} else if (endTime != null) {
+  earliestInstantToLoad = earliestInstantTimeFunc.apply(endTime);
+}
+
+// ensure the earliest instant boundary be loaded.
+if (earliestInstantToLoad != null && 
HoodieTimeline.compareTimestamps(this.cursorInstant, GREATER_THAN, 
earliestInstantToLoad)) {
+  loadCompletionTimeIncrementally(earliestInstantToLoad);
+}
+
+if (startTime == null && endTime != null) {
+  // returns the last instant that finished at or before the given 
completion time 'endTime'.
+  String maxInstantTime = readTimeline.getInstantsAsStream()
+  .filter(instant -> instant.isCompleted() && 
HoodieTimeline.compareTimestamps(instant.getCompletionTime(), 
LESSER_THAN_OR_EQUALS, endTime))
+  
.max(Comparator.comparing(HoodieInstant::getCompletionTime)).map(HoodieInstant::getTimestamp).orElse(null);
+  if (maxInstantTime != null) {
+return Collections.singletonList(maxInstantTime);
+  }
+  // fallback to archived timeline
+  return this.startToCompletionInstantTimeMap.entrySet().stream()
+  .filter(entry -> 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424873530


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,425 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]
+ * The latest completed instant metadata
+ * '_hoodie_commit_time' = i_n, i_n is the latest completed 
instant
+ *   
+ *   
+ * [_, endTime]
+ * i).find the last completed instant i_n before or on 'endTim;
+ * ii). read the latest snapshot from table metadata if i_n is archived or 
the commit metadata if it is still active

Review Comment:
   +1, but for follow-up JIRA issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424806427


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +190,109 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param startTimeThe start completion time.
+   * @param endTime  The end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The instant time set.
+   * @return The sorted instant time list.
*/
-  public Set getStartTimeSet(String startCompletionTime, String 
endCompletionTime) {
+  public List getStartTime(
+  HoodieTimeline readTimeline,
+  @Nullable String startTime,
+  @Nullable String endTime,
+  InstantRange.RangeType rangeType) {
 // assumes any instant/transaction lasts at most 1 day to optimize the 
query efficiency.
-return getStartTimeSet(startCompletionTime, endCompletionTime, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+return getStartTime(readTimeline, startTime, endTime, rangeType, s -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(s, MILLI_SECONDS_IN_ONE_DAY));
+  }
+
+  /**
+   * Queries the instant start time with given completion time range.
+   *
+   * @param startTime   The start completion time.
+   * @param endTime The end completion time.
+   * @param earliestInstantTimeFunc The function to generate the earliest 
start time boundary
+   *with the minimum completion time.
+   *
+   * @return The sorted instant time list.
+   */
+  @VisibleForTesting
+  public List getStartTime(

Review Comment:
   Currently we are using start instant time, +1 to switch to 'begin'.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424806038


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +190,109 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.

Review Comment:
   The timeline can be optionally filtered with `skip_compaction` and 
`skip_clustering` flag. For example, when you wanna filter the clustering, if 
the replace commits is not filtered, the fs view would try to mask out the 
normal commits which make the commit invisible to reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1424804615


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -108,6 +108,9 @@ public static String instantTimePlusMillis(String 
timestamp, long milliseconds)
   public static String instantTimeMinusMillis(String timestamp, long 
milliseconds) {
 try {
   String timestampInMillis = fixInstantTimeCompatibility(timestamp);
+  if (timestampInMillis.length() < MILLIS_INSTANT_TIMESTAMP_FORMAT_LENGTH) 
{

Review Comment:
   For testing compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-12 Thread via GitHub


vinothchandar commented on code in PR #10255:
URL: https://github.com/apache/hudi/pull/10255#discussion_r1420410913


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java:
##
@@ -22,9 +22,12 @@
 import org.apache.hudi.common.util.ValidationUtils;
 
 import java.io.Serializable;
+import java.util.Arrays;
 import java.util.Collections;
+import java.util.List;
 import java.util.Objects;
 import java.util.Set;
+import java.util.stream.Collectors;
 
 /**
  * A instant commits range used for incremental reader filtering.

Review Comment:
   Rename this class to base instant range? Also should this class should be in 
a different package? Like "timeline"?



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -108,6 +108,9 @@ public static String instantTimePlusMillis(String 
timestamp, long milliseconds)
   public static String instantTimeMinusMillis(String timestamp, long 
milliseconds) {
 try {
   String timestampInMillis = fixInstantTimeCompatibility(timestamp);
+  if (timestampInMillis.length() < MILLIS_INSTANT_TIMESTAMP_FORMAT_LENGTH) 
{

Review Comment:
   Can you help me understand whats going on here.



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/IncrementalQueryAnalyzer.java:
##
@@ -0,0 +1,425 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.read;
+
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.InstantRange;
+import org.apache.hudi.common.table.timeline.CompletionTimeQueryView;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.VisibleForTesting;
+import org.apache.hudi.common.util.collection.Pair;
+
+import javax.annotation.Nullable;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+/**
+ * Analyzer for incremental queries.
+ *
+ * The analyzer can supply info about the incremental queries including:
+ * 
+ *   The archived instant candidates;
+ *   The active instant candidates;
+ *   The instant filtering predicate, e.g the instant range;
+ *   Whether the query starts from the earliest;
+ *   Whether the query ends to the latest;
+ *   The max completion time used for fs view file slice version 
filtering.
+ * 
+ *
+ * Criteria for different query ranges:
+ *
+ * 
+ *   
+ * Query Range
+ * File Handles Decoding
+ * Instant Filtering Predicate
+ *   
+ *   
+ * [earliest, _]
+ * The latest snapshot files from table metadata
+ * _
+ *   
+ *   
+ * [earliest, endTime]
+ * The latest snapshot files from table metadata
+ * '_hoodie_commit_time' in setA, setA is a collection of all the 
instants completed before or on 'endTime'
+ *   
+ *   
+ * [_, _]

Review Comment:
   should we even allow this? seems like invalid input to Incremental query. 
Can we simplify to just . 
   
   [startTime, endTime]
   [startTime, _] (we assume _ is latest time)
   [_, endTime] (we assume _ is earliest time)
   ```
   
   



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/CompletionTimeQueryView.java:
##
@@ -175,42 +190,109 @@ public Option getCompletionTime(String 
startTime) {
*
* By default, assumes there is at most 1 day time of duration for an 
instant to accelerate the queries.
*
-   * @param startCompletionTime The start completion time.
-   * @param endCompletionTime   The end completion time.
+   * @param readTimeline The read timeline.
+   * @param startTimeThe start completion time.
+   * @param endTime  The end completion time.
+   * @param rangeTypeThe range type.
*
-   * @return The 

Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-08 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1846746092

   
   ## CI report:
   
   * 3787083e09a0bf7d754d095b5826657efd4f468e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21395)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1846571904

   
   ## CI report:
   
   * bde657c2c091c062c342b0600c3269bbc15ebaaa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21364)
 
   * 3787083e09a0bf7d754d095b5826657efd4f468e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21395)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1846566776

   
   ## CI report:
   
   * bde657c2c091c062c342b0600c3269bbc15ebaaa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21364)
 
   * 3787083e09a0bf7d754d095b5826657efd4f468e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-07 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1845772369

   
   ## CI report:
   
   * bde657c2c091c062c342b0600c3269bbc15ebaaa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21364)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1844707886

   
   ## CI report:
   
   * 07b8dd347e0157ed848cddd4d5bb3b229d42db5a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21338)
 
   * bde657c2c091c062c342b0600c3269bbc15ebaaa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21364)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1844688303

   
   ## CI report:
   
   * 07b8dd347e0157ed848cddd4d5bb3b229d42db5a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21338)
 
   * bde657c2c091c062c342b0600c3269bbc15ebaaa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1843084912

   
   ## CI report:
   
   * 07b8dd347e0157ed848cddd4d5bb3b229d42db5a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21338)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1842679930

   
   ## CI report:
   
   * 07b8dd347e0157ed848cddd4d5bb3b229d42db5a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21338)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7184] Add IncrementalQueryAnalyzer for completion time based in… [hudi]

2023-12-06 Thread via GitHub


hudi-bot commented on PR #10255:
URL: https://github.com/apache/hudi/pull/10255#issuecomment-1842668247

   
   ## CI report:
   
   * 07b8dd347e0157ed848cddd4d5bb3b229d42db5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org