bhabegger opened a new pull request, #2793:
URL: https://github.com/apache/jackrabbit-oak/pull/2793
# LuceneNg (Lucene 9.12.2) Implementation - Phase 1 & 2
This PR implements a new Lucene 9 index module (`oak-search-luceneNg`) for
Apache Jackrabbit Oak, targeting the OAK-12089 epic.
## ๐ฏ Goals of This PR
| Goal | Description | Status |
|------|-------------|--------|
| **New Module** | Create `oak-search-luceneNg` module with Lucene 9.12.2 |
โ
Complete |
| **Write Path** | Implement document indexing via IndexEditor | โ
Complete |
| **Storage** | Oak-native storage with chunked blob support | โ
Complete |
| **Read Path** | Implement query execution with full-text search | โ
Complete |
| **Property Queries** | Support equality constraints on indexed properties
| โ
Complete |
| **Full-Text Search** | Support analyzed text queries with tokenization | โ
Complete |
| **Test Coverage** | Comprehensive unit and integration tests | โ
Complete
(53/53 tests pass) |
| **Build Integration** | Maven build, OSGi bundles, Apache RAT compliance |
โ
Complete |
## ๐ Implementation Status & Roadmap
### โ
Phase 1: Write Path (Complete)
| Component | Status | Tests | Notes |
|-----------|--------|-------|-------|
| **LuceneNgIndexEditor** | โ
Done | 7 tests | Indexes string properties
(single & multi-value) |
| **OakDirectory** | โ
Done | 16 tests | Lucene Directory backed by Oak
NodeStore |
| **Chunked I/O** | โ
Done | 5 tests | Efficient large file handling with
1MB chunks |
| **IndexWriter lifecycle** | โ
Done | 7 tests | Shared writer pattern for
correct commit semantics |
### โ
Phase 2: Read Path - Basic Queries (Complete)
| Component | Status | Tests | Notes |
|-----------|--------|-------|-------|
| **LuceneNgIndex** | โ
Done | 2 tests | QueryIndex implementation with cost
calculation |
| **Full-text queries** | โ
Done | 2 tests | Visitor pattern, tokenization,
phrase/term queries |
| **Property queries** | โ
Done | 5 tests | Exact-match equality constraints
|
| **LuceneNgCursor** | โ
Done | 7 tests | Result iteration with score
support |
| **Query planner integration** | โ
Done | 2 tests | Cost-based index
selection (cost = 2.0) |
### ๐ง Phase 2: Read Path - Advanced (Planned)
| Feature | Priority | Complexity | Notes |
|---------|----------|------------|-------|
| **Range queries** | High | Medium | Support `<`, `>`, `<=`, `>=` operators
|
| **Boolean queries** | High | Medium | Complex AND/OR/NOT combinations |
| **Sorting** | Medium | Medium | ORDER BY support |
| **Aggregation rules** | Medium | High | Property aggregation across node
types |
| **Highlighting** | Low | Medium | rep:excerpt support |
| **Faceting** | Low | High | rep:facet support |
### โณ Phase 3: Migration & Production (Future)
| Feature | Priority | Complexity | Notes |
|---------|----------|------------|-------|
| **Hot migration** | High | High | Migrate from Lucene 4.7 without downtime
|
| **Index compatibility** | High | High | Read existing lucene indexes |
| **Performance benchmarks** | High | Medium | Compare with legacy Lucene |
| **AEM integration testing** | High | High | Validate in AEM environment |
| **Documentation** | Medium | Low | Usage guides, migration docs |
## ๐ฆ What's Included
### New Module Structure
```
oak-search-luceneNg/
โโโ src/main/java/
โ โโโ org/apache/jackrabbit/oak/plugins/index/luceneNg/
โ โโโ LuceneNgIndex.java # Query execution
โ โโโ LuceneNgIndexEditor.java # Document indexing
โ โโโ LuceneNgCursor.java # Result iteration
โ โโโ LuceneNgIndexTracker.java # Index lifecycle
โ โโโ LuceneNgIndexDefinition.java # Index metadata
โ โโโ IndexSearcherHolder.java # Search resource management
โ โโโ directory/
โ โโโ OakDirectory.java # Lucene Directory implementation
โ โโโ OakIndexInput.java # Read operations
โ โโโ OakIndexOutput.java # Write operations
โโโ src/test/java/
โโโ LuceneNgComparisonTest.java # Property query validation
โโโ IntegrationTest.java # End-to-end tests
โโโ IndexingFunctionalTest.java # Indexing edge cases
โโโ directory/ # Storage layer tests
```
### Key Features
**Query Support:**
- โ
Full-text search with StandardAnalyzer tokenization
- โ
Property equality queries (`@property = 'value'`)
- โ
Proper cost-based query planning
- โ
Score-based result ranking
**Indexing:**
- โ
String properties (single and multi-value)
- โ
Full-text aggregation to `:fulltext` field
- โ
Exact-match fields for property queries
- โ
32KB term length handling
**Storage:**
- โ
Oak NodeStore integration via `:data` child node
- โ
Chunked blob storage (1MB chunks)
- โ
Concurrent read/write support
- โ
Memory-efficient streaming
## ๐งช Test Results
**All 53 tests pass:**
- โ
16 OakDirectory tests (storage layer)
- โ
7 IndexingFunctionalTest (write path)
- โ
5 LuceneNgComparisonTest (property queries)
- โ
5 IntegrationTest (end-to-end)
- โ
20 additional unit tests (components, tracking, etc.)
**Build:**
```
mvn clean install
[INFO] Tests run: 53, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```
## ๐ Technical Highlights
### 1. Proper Full-Text Query Building
Implements visitor pattern matching legacy Lucene behavior:
- Tokenizes query text using StandardAnalyzer
- Builds PhraseQuery for multi-token terms
- Handles FullTextAnd, FullTextOr, FullTextTerm expressions
### 2. Shared IndexWriter Pattern
Root editor creates IndexWriter, child editors share it:
- Prevents data loss from multiple writers
- Correct commit semantics across node tree
- Proper resource cleanup
### 3. Dynamic NodeBuilder Access
Avoids staleness issues during commits:
```java
private NodeBuilder getDirectoryBuilder() {
return definitionBuilder.child(INDEX_DATA_CHILD_NAME);
}
```
### 4. Field Strategy
- **StringField**: Exact matching for property queries (not analyzed)
- **TextField**: Analyzed text for full-text search (FieldNames.FULLTEXT)
- **Path storage**: Stored field for cursor results
## ๐ Related Issues
- **OAK-12089**: Epic for Lucene 9 migration
- Builds on exploration work from earlier branches
## ๐ Notes for Reviewers
1. **Module isolation**: New module doesn't affect existing lucene/elastic
modules
2. **Dependency embedding**: Lucene 9.12.2 libs embedded to avoid conflicts
3. **Test independence**: All tests use in-memory storage, no external
dependencies
4. **Apache compliance**: All files have Apache license headers, RAT check
passes
## โ
Checklist
- [x] All tests pass
- [x] Apache RAT license check passes
- [x] Code follows Oak patterns (QueryIndex, IndexEditor, Cursor)
- [x] No backwards compatibility issues (new module, opt-in)
- [x] Documentation in code comments
- [x] Test coverage for all major code paths
---
**Ready for review!** This PR establishes the foundation for Lucene 9
support in Oak. Phase 2 advanced features and Phase 3 migration can be tackled
in subsequent PRs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]