atris opened a new issue, #14997:
URL: https://github.com/apache/lucene/issues/14997
### Description
**Summary**
This RFC proposes implementing SPANN (Space Partition Approximate Nearest
Neighbor) as an additional vector format for Lucene. SPANN provides an
alternative approach to vector search that excels in different scenarios than
our current HNSW implementation, particularly for memory-constrained
environments and filtered search workloads.
SPANN uses spatial partitioning instead of graph structures, enabling
different performance characteristics and resource usage patterns. This
provides users with choice in vector search implementations based on their
specific requirements.
**Problem Statement**
Production vector search deployments often face constraints that current
implementations struggle to address:
Memory scalability: Large vector indices require substantial memory for
optimal performance, limiting deployment options and increasing infrastructure
costs.
Filtered search efficiency: When combining vector similarity with document
filters, current approaches examine vectors that may not match filter criteria.
Merge resource usage: Segment merge operations for vector data can require
significant temporary memory allocation.
Hardware flexibility: Different deployment environments have varying
memory, CPU, and storage characteristics that could benefit from different
algorithmic approaches.
**SPANN Solution Overview**
SPANN addresses these challenges through spatial partitioning. Vectors are
clustered into spatial regions during indexing, with only lightweight partition
metadata kept in memory. Search operations filter partitions based on spatial
proximity before examining individual vectors.
This approach provides:
- Bounded memory usage independent of index size
- Efficient filtered search through partition pre-filtering
- Linear merge complexity with predictable resource usage
- Sequential I/O patterns optimized for modern storage
**Technical Architecture**
**Core Algorithm Details**
Partitioning Algorithm: SPANN uses k-means++ initialization followed by
Lloyd's iteration for partition creation. The k-means++ initialization provides
better initial centroid placement than random selection, leading to higher
quality
partitions:
// K-means++ initialization
float[][] selectInitialCentroids(float[][] vectors, int k) {
float[][] centroids = new float[k][];
centroids[0] = vectors[random.nextInt(vectors.length)];
for (int i = 1; i < k; i++) {
float[] distances = computeMinDistances(vectors, centroids, i);
float[] probabilities = computeProbabilities(distances);
centroids[i] = selectWithProbability(vectors, probabilities);
}
return centroids;
}
// Lloyd's iteration for convergence
void refinePartitions(float[][] vectors, float[][] centroids, int
maxIters) {
for (int iter = 0; iter < maxIters; iter++) {
int[] assignments = assignToNearestCentroid(vectors, centroids);
float[][] newCentroids = recomputeCentroids(vectors, assignments);
if (hasConverged(centroids, newCentroids, CONVERGENCE_THRESHOLD)) {
break;
}
centroids = newCentroids;
}
}
Partition Count Selection: The number of partitions is determined
dynamically based on vector count and dimensionality:
int computeOptimalPartitionCount(int numVectors, int dimensions) {
// Target ~10K vectors per partition for optimal I/O
int basePartitions = (int) Math.ceil(numVectors / 10000.0);
// Adjust for high dimensions which need smaller partitions
double dimensionFactor = Math.sqrt(dimensions / 128.0);
int adjustedPartitions = (int) (basePartitions * dimensionFactor);
// Bounds to ensure reasonable partition counts
return Math.max(10, Math.min(10000, adjustedPartitions));
}
Partition Selection During Search: The number of partitions to examine is
determined by the search k value and partition quality:
int computePartitionsToExamine(int k, int totalPartitions, float[]
queryVector,
float[][] centroids) {
// Base expansion factor for approximate search
double expansionFactor = 1.5 + Math.log10(k);
int baseExamine = (int) Math.ceil(k * expansionFactor);
// Adjust based on query's distance to nearest centroids
float nearestDist = Float.MAX_VALUE;
float secondDist = Float.MAX_VALUE;
for (float[] centroid : centroids) {
float dist = computeDistance(queryVector, centroid);
if (dist < nearestDist) {
secondDist = nearestDist;
nearestDist = dist;
} else if (dist < secondDist) {
secondDist = dist;
}
}
// If query is very close to one centroid, examine fewer partitions
float ratio = nearestDist / secondDist;
if (ratio < 0.5) {
baseExamine = (int) (baseExamine * 0.7);
}
// Never examine more than 10% of partitions
return Math.min(baseExamine, totalPartitions / 10);
}
**File Format Specifications**
Metadata File (.spm):
Header:
- Magic: int32 (0x53504D30) // "SPM0"
- Version: int32
- CodecHeader: standard Lucene codec header
Partition Metadata:
- NumPartitions: vint
- Dimensions: vint
- For each partition:
- PartitionId: vint
- NumVectors: vint
- ByteOffset: vlong (offset in .spd file)
- ByteLength: vlong (compressed size if applicable)
- Centroid: float[dimensions]
- RadiusSquared: float (max distance from centroid)
Footer:
- CodecFooter: standard Lucene checksum footer
Vector Data File (.spd):
Header:
- Magic: int32 (0x53504430) // "SPD0"
- Version: int32
- CodecHeader: standard Lucene codec header
For each partition (in order):
- PartitionHeader:
- NumVectors: vint
- CompressionType: byte (0=none, 1=bulk)
- For each vector in partition:
- DocId: vint (delta-encoded within partition)
- Vector: float[dimensions] or byte[compressedSize]
Footer:
- CodecFooter: standard Lucene checksum footer
Index File (.spi):
Header:
- Magic: int32 (0x53504930) // "SPI0"
- Version: int32
- CodecHeader: standard Lucene codec header
Document Index:
- NumDocs: vint
- For each document (in docId order):
- PartitionId: vint
- IntraPartitionOrd: vint
Auxiliary Structures:
- DeletedDocs: RoaringBitmap (if any deletes)
- FieldInfo: standard Lucene field metadata
Footer:
- CodecFooter: standard Lucene checksum footer
**Memory Layout and Access Patterns**
In-Memory Structures:
class SpannSearcher {
// Primary memory-resident data
final float[][] partitionCentroids; // k * dimensions * 4 bytes
final float[] partitionRadii; // k * 4 bytes
final long[] partitionOffsets; // k * 8 bytes
final int[] partitionSizes; // k * 4 bytes
// Derived data for optimization
final float[] centroidNorms; // k * 4 bytes (for dot
product)
final PartitionStats[] stats; // k * 32 bytes (min/max/mean)
// Total memory: ~k * (dimensions * 4 + 56) bytes
// For 1000 partitions, 768 dimensions: ~3.1MB
}
Sequential Access Optimization:
void searchPartition(int partitionId, float[] query, KnnCollector
collector) {
long offset = partitionOffsets[partitionId];
int size = partitionSizes[partitionId];
// Bulk read entire partition for sequential access
ByteBuffer buffer = ByteBuffer.allocateDirect(size * vectorBytes);
vectorData.seek(offset);
vectorData.readBytes(buffer);
// Process vectors with CPU cache-friendly access
for (int i = 0; i < size; i++) {
int docId = readVInt(buffer);
float[] vector = readVector(buffer);
// Early termination if score can't improve results
float score = computeSimilarity(query, vector);
if (score > collector.minCompetitiveSimilarity()) {
collector.collect(docId, score);
}
}
}
**Merge Strategy Implementation Details**
Partition Quality Metrics:
class PartitionQuality {
float computeQuality(PartitionInfo partition) {
// Cohesion: how close vectors are to centroid
float avgDistToCentroid = partition.sumDistances /
partition.numVectors;
float normalizedCohesion = 1.0f / (1.0f + avgDistToCentroid);
// Balance: how evenly sized partitions are
float targetSize = totalVectors / numPartitions;
float sizeRatio = partition.numVectors / targetSize;
float balance = 1.0f - Math.abs(1.0f - sizeRatio);
// Separation: how far this partition is from others
float minDistToOther = Float.MAX_VALUE;
for (PartitionInfo other : partitions) {
if (other != partition) {
float dist = computeDistance(partition.centroid,
other.centroid);
minDistToOther = Math.min(minDistToOther, dist);
}
}
float separation = minDistToOther / globalMaxDistance;
// Weighted quality score
return 0.4f * normalizedCohesion + 0.3f * balance + 0.3f *
separation;
}
}
Adaptive Merge Decision Logic:
MergeStrategy selectMergeStrategy(MergeState mergeState) {
float changeRatio = (float) mergeState.newVectorCount /
mergeState.existingVectorCount;
float avgQuality = computeAveragePartitionQuality();
if (changeRatio < 0.05 && avgQuality > 0.8) {
// Tiny merge with good partitions: preserve structure
return MergeStrategy.PRESERVE_PARTITIONS;
} else if (changeRatio < 0.2 && avgQuality > 0.6) {
// Small merge: rebalance worst partitions only
return MergeStrategy.SELECTIVE_REBALANCE;
} else if (changeRatio < 0.5) {
// Medium merge: incremental repartitioning
return MergeStrategy.INCREMENTAL_REPARTITION;
} else {
// Large merge: full repartitioning for optimal structure
return MergeStrategy.FULL_REPARTITION;
}
}
**Search Algorithm Implementation**
Multi-Level Partition Filtering:
List<Integer> selectPartitionsToSearch(float[] query, int k, SearchParams
params) {
// Phase 1: Compute all centroid distances
PriorityQueue<PartitionScore> pq = new
PriorityQueue<>(comparingDouble(p -> p.distance));
for (int i = 0; i < numPartitions; i++) {
float distance = computeDistance(query, partitionCentroids[i]);
pq.offer(new PartitionScore(i, distance));
}
// Phase 2: Select partitions using adaptive threshold
List<Integer> selected = new ArrayList<>();
float minDistance = pq.peek().distance;
float threshold = minDistance * params.getExpansionFactor(k);
while (!pq.isEmpty() && selected.size() < maxPartitionsToExamine) {
PartitionScore ps = pq.poll();
if (ps.distance > threshold && selected.size() >= k) {
break; // Have enough partitions
}
selected.add(ps.partitionId);
}
// Phase 3: Add neighbor partitions for boundary cases
if (params.enableBoundarySearch) {
addNeighborPartitions(selected, query);
}
return selected;
}
Optimized Within-Partition Search:
void searchPartitionOptimized(int partitionId, float[] query, KnnCollector
collector,
Bits acceptDocs) throws IOException {
// Memory-mapped access for large partitions
if (partitionSizes[partitionId] > MMAP_THRESHOLD) {
searchPartitionMapped(partitionId, query, collector, acceptDocs);
return;
}
// Bulk load for smaller partitions
VectorReader reader = loadPartition(partitionId);
// SIMD-friendly processing loop
int vecSize = reader.size();
int simdWidth = SimdUtils.getWidth();
// Process vectors in SIMD-width batches
for (int i = 0; i < vecSize; i += simdWidth) {
int batchSize = Math.min(simdWidth, vecSize - i);
float[] scores = new float[batchSize];
// Vectorized similarity computation
SimdUtils.computeSimilarityBatch(query, reader, i, batchSize,
scores);
// Collect results
for (int j = 0; j < batchSize; j++) {
int docId = reader.getDocId(i + j);
if (acceptDocs == null || acceptDocs.get(docId)) {
collector.collect(docId, scores[j]);
}
}
}
}
**Configuration and Tuning Parameters**
public class SpannConfig {
// Partitioning parameters
int minPartitionSize = 1000; // Minimum vectors per partition
int maxPartitionSize = 50000; // Maximum vectors per
partition
int targetVectorsPerPartition = 10000; // Optimal partition size
// Clustering parameters
int kmeansMaxIterations = 50; // Max iterations for
convergence
double kmeansConvergence = 0.001; // Convergence threshold
int kmeansSampleSize = 100000; // Sample size for large
datasets
// Search parameters
double baseExpansionFactor = 1.5; // Base partition expansion
double expansionGrowthRate = 0.1; // Growth per log10(k)
int maxPartitionFraction = 10; // Max 1/10th of partitions
examined
// Performance parameters
int mmapThreshold = 1048576; // Use mmap for partitions > 1MB
int bulkReadSize = 65536; // Bulk I/O size in bytes
boolean enableSimd = true; // Use SIMD optimizations
// Merge parameters
float qualityThreshold = 0.6f; // Min quality to preserve
partitions
float rebalanceThreshold = 0.8f; // Quality threshold for
rebalancing
}
**Design Considerations and Trade-offs**
**Algorithmic Design Decisions**
K-means++ vs Random Initialization: K-means++ requires more computation
during initialization but produces significantly more stable partitions. The
additional O(k²) initialization cost is worthwhile given that poor initial
centroids
can lead to highly imbalanced partitions affecting search performance.
Partition Count Formula: The target of ~10K vectors per partition balances
several factors: smaller partitions enable more selective filtering but
increase metadata overhead, while larger partitions reduce overhead but force
examination
of more irrelevant vectors. The dimension-based adjustment accounts for
the "curse of dimensionality" where high-dimensional spaces require more
partitions to maintain effective separation.
Quality Metric Weighting: The partition quality metric balances cohesion
(0.4), balance (0.3), and separation (0.3). These weights reflect that cohesion
is most critical for search accuracy, while balance and separation prevent
pathological cases like single-vector partitions or overlapping clusters.
**Memory vs. Accuracy Trade-offs**
Partition Examination Strategy: The expansion factor (1.5 + log₁₀(k))
grows slowly with k to balance recall quality against computational cost.
Linear growth would examine too many partitions for large k values, while
constant expansion
would miss relevant results for small k queries.
Boundary Handling: The optional neighbor partition addition addresses the
boundary problem where optimal results might lie near partition edges. This
adds computational overhead but can significantly improve recall in edge cases.
**Implementation Complexity Considerations**
File Format Evolution: The three-file format enables independent
optimization of different data access patterns while maintaining backward
compatibility. Metadata and index files are typically small and can be
memory-resident, while
vector data can be streamed or memory-mapped based on access patterns.
Merge Strategy Selection: The adaptive merge logic reflects different
cost/benefit ratios at various scales. Small merges benefit from structure
preservation to avoid repartitioning overhead, while large merges benefit from
global
optimization despite the computational cost.
**Performance Characteristics**
**Complexity Analysis**
Indexing Complexity:
- Vector assignment: O(n * k * d) where n=vectors, k=partitions,
d=dimensions
- K-means clustering: O(k * n * d * iterations)
- Total flush complexity: O(n * k * d * log(n/k)) including sorting
Search Complexity:
- Partition selection: O(k * d) for centroid distance computation
- Within-partition search: O(p * n_p * d) where p=partitions examined,
n_p=vectors per partition
- Total search: O(k * d + p * n_p * d) typically much less than HNSW's
O(log(n) * d * M)
Merge Complexity:
- Preserve partitions: O(n) simple copying
- Selective rebalance: O(n_affected * k * d) for affected partitions only
- Full repartition: O(n * k * d * iterations) same as initial indexing
**Memory Usage Patterns**
Indexing Memory:
During flush:
- Vector buffer: n * d * 4 bytes (temporary)
- Clustering workspace: k * d * 4 + k * 8 bytes
- Total: O(n * d) temporary, O(k * d) permanent
Search Memory:
Permanent:
- Centroids: k * d * 4 bytes
- Metadata: k * 24 bytes
- Total: ~3MB for typical configuration
Per-query:
- Priority queue: k * 12 bytes
- Partition buffer: p * n_p * d * 4 bytes (can be streamed)
**Detailed Implementation Phases**
Phase 1: Foundation Implementation
Core Format Infrastructure:
- Implement Lucene95SpannVectorsFormat class hierarchy
- Basic three-file format with standard Lucene headers/footers
- Simple k-means partitioning algorithm for initial clustering
- Integration with SegmentWriteState and SegmentReadState
Basic Writer Implementation:
- Lucene95SpannVectorsWriter with vector accumulation and flush-time
partitioning
- Partition assignment algorithms and storage layout optimization
- Integration with KnnFieldVectorsWriter interface
- Memory usage tracking via Accountable interface
Basic Reader Implementation:
- Lucene95SpannVectorsReader with partition metadata loading
- Simple sequential search within selected partitions
- Integration with existing KnnCollector framework
- Standard integrity checking and validation
Testing Infrastructure:
- Unit tests for partitioning algorithms and file format
- Integration tests with existing Lucene test framework
- Basic performance validation against simple datasets
- Compatibility testing with compound file format
Deliverables: Working SPANN implementation with basic functionality,
passing all compatibility tests.
Phase 2: Search Optimization
Partition Selection Algorithms:
- Sophisticated partition ranking based on centroid distances and sizes
- Early termination logic to avoid examining distant partitions
- Integration with query-time parameters (k, accuracy requirements)
- Support for different distance metrics and similarity functions
Access Pattern Optimization:
- Random access implementation for sparse partition selection
- Sequential streaming for dense partition examination
- Automatic selection between access patterns based on query
characteristics
- Memory-mapped I/O integration for large partition handling
Within-Partition Optimizations:
- Vectorized distance computations where supported
- Cache-friendly data layout and processing patterns
- SIMD utilization hints for modern CPU architectures
- Prefetching strategies for predictable access patterns
Search API Integration:
- Full compatibility with existing search APIs and parameters
- Integration with document filtering (Bits interface)
- Support for both float and byte vector encodings
- Proper handling of deleted documents and live docs
Deliverables: Optimized search implementation competitive with existing
formats on standard benchmarks.
Phase 3: Advanced Merge Implementation
Adaptive Merge Strategies:
- Algorithms for determining when to preserve vs. rebuild partition
structures
- Partition quality metrics for rebalancing decisions
- Integration with segment size predictions and merge policy hints
- Handling of edge cases (empty segments, single-vector segments)
Merge Performance Optimization:
- Bulk copying optimizations for preserved partitions
- Efficient repartitioning algorithms for structure updates
- Memory usage minimization during merge operations
- Progress reporting and cancellation support
Advanced Partitioning Algorithms:
- Beyond basic k-means: hierarchical clustering, density-based approaches
- Partition size balancing and split/merge logic
- Handling of outlier vectors and sparse regions
- Incremental partition updates for small merges
Integration with Merge Framework:
- Proper handling of MergeState and document mapping
- Integration with merge schedulers and policies
- Support for merge-time document sorting and field updates
- Compatibility with merge abortion and error recovery
Deliverables: Production-ready merge implementation with superior
performance characteristics.
Phase 4: Production Hardening
Memory Management Refinement:
- Precise memory accounting and reporting
- Integration with Lucene's memory pressure monitoring
- Configurable memory limits and adaptive behavior
- Garbage collection optimization and reduced allocation
I/O Performance Optimization:
- Advanced memory-mapped I/O strategies
- Async I/O support where beneficial
- Integration with Lucene's directory abstraction optimizations
- Proper handling of different storage types (SSD, HDD, network)
Configuration and Tuning:
- Exposed parameters for partition count, size targets, algorithms
- Auto-tuning based on vector characteristics and hardware
- Performance monitoring and diagnostic capabilities
- Integration with existing Lucene monitoring infrastructure
Error Handling and Recovery:
- Comprehensive error handling for corruption scenarios
- Recovery procedures for partially written segments
- Integration with Lucene's checksum and integrity verification
- Proper resource cleanup and exception safety
Deliverables: Robust, production-ready implementation with comprehensive
error handling and monitoring.
Phase 5: Integration and Validation
Comprehensive Testing:
- Large-scale testing with realistic datasets and query patterns
- Performance benchmarking across different hardware configurations
- Memory usage validation under various load patterns
- Compatibility testing with all Lucene features (compound files,
checksums, etc.)
Migration and Compatibility:
- Tools for migrating from existing vector formats
- Mixed-format index support during transition periods
- Backward compatibility testing with existing applications
- Documentation for migration procedures and best practices
Performance Analysis:
- Detailed benchmarking against existing implementations
- Analysis of performance characteristics across different scenarios
- Memory usage profiling and optimization validation
- I/O pattern analysis and storage optimization verification
Documentation and Tooling:
- Complete API documentation and usage examples
- Performance tuning guides and best practices
- Administrative tools for index analysis and optimization
- Integration with existing Lucene tooling ecosystem
Deliverables: Fully validated implementation ready for production
deployment with complete documentation.
Phase 6: Community Integration
Code Review and Refinement:
- Community review of implementation and design decisions
- Code quality improvements and consistency with Lucene standards
- Performance validation by community members
- Integration feedback and final adjustments
Documentation Finalization:
- User documentation and migration guides
- Developer documentation for extensibility
- Performance characteristics and tuning recommendations
- Troubleshooting guides and common issues
Release Preparation:
- Final testing and validation
- Release notes and changelog preparation
- Coordination with Lucene release process
- Community communication and adoption planning
Deliverables: Implementation ready for inclusion in Lucene release with
full community validation.
**Open Technical Questions**
Algorithm Variations:
- Alternative clustering algorithms (mini-batch k-means, hierarchical
clustering) may provide different trade-offs between partitioning quality and
computational cost
- Dynamic partition count adjustment based on distribution characteristics
could optimize for different vector datasets
- Learned partition boundaries using ML techniques might improve upon
traditional clustering methods
Optimization Opportunities:
- Product quantization within partitions could provide compression
benefits while maintaining search quality
- Hierarchical partitioning for very large indices might enable better
scalability than flat partitioning
Integration Considerations:
- Interaction with Lucene's emerging vector quantization work needs
evaluation for compatibility and optimization opportunities
- Support for incremental indexing patterns beyond the current
segment-based approach
- Integration with distributed search scenarios and cross-segment
optimization opportunities
Validation Requirements:
- Comprehensive benchmarking across diverse vector datasets to validate
algorithmic assumptions
- Memory usage validation under realistic production workloads and various
JVM configurations
- Performance comparison methodology development for fair evaluation
against existing implementations
**Conclusion**
SPANN provides a production-ready alternative to current vector search
implementations, with clear technical specifications for core algorithms and
data structures. The implementation leverages Lucene's architectural strengths
while providing predictable performance characteristics and resource usage
patterns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]