churromorales commented on code in PR #15620: URL: https://github.com/apache/lucene/pull/15620#discussion_r2794696399
########## lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java: ########## @@ -0,0 +1,980 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.TreeMap; +import java.util.concurrent.TimeUnit; +import java.util.logging.Logger; +import org.apache.lucene.codecs.PointsFormat; +import org.apache.lucene.codecs.PointsReader; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; + +/** + * A merge policy that groups segments by time windows and merges segments within the same window, + * This policy is designed for time-series data where documents contain a timestamp field indexed as + * a {@link org.apache.lucene.document.LongPoint}. + * + * <p>This policy organizes segments into time buckets based on the maximum timestamp in each + * segment. Recent data goes into small time windows (e.g., 1 hour), while older data is grouped + * into exponentially larger windows (e.g., 4 hours, 16 hours, etc.). Segments within the same time + * window are merged together when they meet the configured thresholds, but segments from different + * time windows are never merged together, preserving temporal locality. + * + * <p><b>When to use this policy:</b> + * + * <ul> + * <li>Time-series data where queries typically filter by time ranges + * <li>Data with a timestamp field that can be used for bucketing + * <li>Workloads where older data is queried less frequently than recent data + * <li>Use cases where you want to avoid mixing old and new data in the same segment + * </ul> + * + * <p><b>Configuration:</b> + * + * <pre class="prettyprint"> + * TemporalMergePolicy policy = new TemporalMergePolicy() + * .setTemporalField("timestamp") // Required: name of the timestamp field + * .setBaseTimeSeconds(3600) // Base window size: 1 hour + * .setMinThreshold(4) // Merge when 4+ segments in a window + * .setMaxThreshold(8) // Merge at most 8 segments at once + * .setCompactionRatio(1.2) // Size ratio threshold for merging + * .setUseExponentialBuckets(true); // Use exponentially growing windows + * + * IndexWriterConfig config = new IndexWriterConfig(analyzer); + * config.setMergePolicy(policy); + * </pre> + * + * <p><b>Time bucketing:</b> When {@link #setUseExponentialBuckets} is true (default), window sizes + * grow exponentially: {@code baseTime}, {@code baseTime * minThreshold}, {@code baseTime * + * minThreshold^2}, etc. This ensures that recent data is in small, frequently-merged windows while + * older data is in larger, less-frequently-merged windows. When false, all windows have the same + * size ({@code baseTime}). + * + * <p><b>Compaction ratio:</b> The {@link #setCompactionRatio} parameter controls when merges are + * triggered. A merge is considered when the total document count across candidate segments exceeds + * {@code largestSegment * compactionRatio}. Lower values (e.g., 1.2) trigger merges more + * aggressively, while higher values (e.g., 2.0) allow more segments to accumulate before merging. + * Set to 1.0 for most aggressive merging. + * + * <p><b>NOTE:</b> This policy requires a timestamp field indexed as a {@link + * org.apache.lucene.document.LongPoint}. The timestamp can be in seconds, milliseconds, or + * microseconds (auto-detected based on value magnitude). + * + * <p><b>NOTE:</b> Segments from different time windows are never merged together, even during + * {@link IndexWriter#forceMerge(int)}. If you call {@code forceMerge(1)} but have segments in + * multiple time windows, you will end up with one segment per time window. + * + * <p><b>NOTE:</b> Very old segments (older than {@link #setMaxAgeSeconds}) are not merged to avoid + * unnecessary I/O on cold data. + * + * @lucene.experimental + */ +public class TemporalMergePolicy extends MergePolicy { + + private static final Logger log = Logger.getLogger(TemporalMergePolicy.class.getName()); + + // Configuration parameters + private String temporalField = ""; Review Comment: good question, so looking at `TieredMergePolicy`, the default segments per tier is 8, which I tried to follow. Which seems to balance merge efficiency and I/O costs. As for the `minThreshold` I assume the default policy is going to be exponential bucketing, so `4` seems to make the most sense. It is not too slow, if I chose something lower like: `2` it would take a long time to reach multi-day windows, if i chose something higher, like `8` we would lose temporal granularity too quickly. In all honestly it could be whatever, I am thinking about the default values, and tried to base it off defaults for `TieredMergePolicy` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
