Re: [PR] Add Temporal Merge Policy for time-series data [lucene]

via GitHub Wed, 11 Feb 2026 10:00:25 -0800


churromorales commented on code in PR #15620:
URL: https://github.com/apache/lucene/pull/15620#discussion_r2794696399



##########
lucene/core/src/java/org/apache/lucene/index/TemporalMergePolicy.java:
##########
@@ -0,0 +1,980 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.TreeMap;
+import java.util.concurrent.TimeUnit;
+import java.util.logging.Logger;
+import org.apache.lucene.codecs.PointsFormat;
+import org.apache.lucene.codecs.PointsReader;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.IOContext;
+
+/**
+ * A merge policy that groups segments by time windows and merges segments 
within the same window,
+ * This policy is designed for time-series data where documents contain a 
timestamp field indexed as
+ * a {@link org.apache.lucene.document.LongPoint}.
+ *
+ * <p>This policy organizes segments into time buckets based on the maximum 
timestamp in each
+ * segment. Recent data goes into small time windows (e.g., 1 hour), while 
older data is grouped
+ * into exponentially larger windows (e.g., 4 hours, 16 hours, etc.). Segments 
within the same time
+ * window are merged together when they meet the configured thresholds, but 
segments from different
+ * time windows are never merged together, preserving temporal locality.
+ *
+ * <p><b>When to use this policy:</b>
+ *
+ * <ul>
+ *   <li>Time-series data where queries typically filter by time ranges
+ *   <li>Data with a timestamp field that can be used for bucketing
+ *   <li>Workloads where older data is queried less frequently than recent data
+ *   <li>Use cases where you want to avoid mixing old and new data in the same 
segment
+ * </ul>
+ *
+ * <p><b>Configuration:</b>
+ *
+ * <pre class="prettyprint">
+ * TemporalMergePolicy policy = new TemporalMergePolicy()
+ *     .setTemporalField("timestamp")           // Required: name of the 
timestamp field
+ *     .setBaseTimeSeconds(3600)                // Base window size: 1 hour
+ *     .setMinThreshold(4)                      // Merge when 4+ segments in a 
window
+ *     .setMaxThreshold(8)                      // Merge at most 8 segments at 
once
+ *     .setCompactionRatio(1.2)                 // Size ratio threshold for 
merging
+ *     .setUseExponentialBuckets(true);         // Use exponentially growing 
windows
+ *
+ * IndexWriterConfig config = new IndexWriterConfig(analyzer);
+ * config.setMergePolicy(policy);
+ * </pre>
+ *
+ * <p><b>Time bucketing:</b> When {@link #setUseExponentialBuckets} is true 
(default), window sizes
+ * grow exponentially: {@code baseTime}, {@code baseTime * minThreshold}, 
{@code baseTime *
+ * minThreshold^2}, etc. This ensures that recent data is in small, 
frequently-merged windows while
+ * older data is in larger, less-frequently-merged windows. When false, all 
windows have the same
+ * size ({@code baseTime}).
+ *
+ * <p><b>Compaction ratio:</b> The {@link #setCompactionRatio} parameter 
controls when merges are
+ * triggered. A merge is considered when the total document count across 
candidate segments exceeds
+ * {@code largestSegment * compactionRatio}. Lower values (e.g., 1.2) trigger 
merges more
+ * aggressively, while higher values (e.g., 2.0) allow more segments to 
accumulate before merging.
+ * Set to 1.0 for most aggressive merging.
+ *
+ * <p><b>NOTE:</b> This policy requires a timestamp field indexed as a {@link
+ * org.apache.lucene.document.LongPoint}. The timestamp can be in seconds, 
milliseconds, or
+ * microseconds (auto-detected based on value magnitude).
+ *
+ * <p><b>NOTE:</b> Segments from different time windows are never merged 
together, even during
+ * {@link IndexWriter#forceMerge(int)}. If you call {@code forceMerge(1)} but 
have segments in
+ * multiple time windows, you will end up with one segment per time window.
+ *
+ * <p><b>NOTE:</b> Very old segments (older than {@link #setMaxAgeSeconds}) 
are not merged to avoid
+ * unnecessary I/O on cold data.
+ *
+ * @lucene.experimental
+ */
+public class TemporalMergePolicy extends MergePolicy {
+
+  private static final Logger log = 
Logger.getLogger(TemporalMergePolicy.class.getName());
+
+  // Configuration parameters
+  private String temporalField = "";

Review Comment:
   good question, so looking at `TieredMergePolicy`, the default segments per 
tier is 8, which I tried to follow. Which seems to balance merge efficiency and 
I/O costs.  As for the `minThreshold` I assume the default policy is going to 
be exponential bucketing, so `4` seems to make the most sense.  It is not too 
slow, if I chose something lower like: `2` it would take a long time to reach 
multi-day windows, if i chose something higher, like `8` we would lose temporal 
granularity too quickly.   In all honestly it could be whatever, I am thinking 
about the default values, and tried to base it off defaults for 
`TieredMergePolicy`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add Temporal Merge Policy for time-series data [lucene]

Reply via email to