Very cool! I remember wanting something like this when I was looking into making Lucene a bit better at ingesting small structured documents like IndexGeoNames ( https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/extra/perf/IndexGeoNames.java). Is your POC complete enough to get a sense of the speedup that we'd get on this benchmark?
On Tue, Apr 28, 2026 at 3:53 AM Tim Brooks <[email protected]> wrote: > Hi all, > > I'd like to propose adding a column-oriented document-ingestion API to > IndexWriter and get early feedback on the shape before opening a PR. I've > been prototyping this on a branch and would like to understand community > appetite before pushing further. > > https://github.com/apache/lucene/pull/15990 > > ## The concept > > Today IndexWriter consumes an Iterable<IndexableField> per document: the > indexing chain walks each field, re-resolves FieldInfo / PerField state, > revalidates the field type against the schema, and interleaves > stored-fields, postings, doc-values and points per document. > > The proposal is to add a parallel intake path: > IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns, > where each Column represents one field across all documents in the batch. > The indexing chain then processes the batch in two passes: > > 1. A row-oriented pass for stored fields and the inverted index (per-doc > processing still matters there). > 2. A column-oriented pass for doc values, vectors, and points (where > per-field bulk writes are a natural fit). > > Column itself is just metadata (name, IndexableFieldType, density). > Iteration happens through typed cursors obtained from the subclasses: > LongColumn for numeric DV, 1-D points, and numeric stored; BinaryColumn for > binary/sorted DV, text/binary stored, and binary-encoded points; and > VectorColumn for KNN vectors. Each cursor call returns a fresh cursor, so a > column can be traversed once in the row pass and again in the column pass. > > ## Two benefits motivate this: > > 1. More compact in-memory representation during indexing. A column batch > avoids the per-field allocations of the document-at-a-time path > (IndexableField instances, per-doc FieldType references, per-doc attribute > maps). For numeric DV and points in particular, the caller can hand us a > primitive-backed cursor that the chain drains directly into > PackedLongValues / the points writer without indirection. > 2. Less redundant field validation. Field name, type, indexing options, > and schema compatibility are resolved once per column instead of once per > IndexableField. For workloads where a caller already knows the schema of a > batch, that revalidation is pure overhead. > > All in all, these changes drop CPU usage dedicated to > IndexWriter#addDocuments 4-5x for analytic heavy workloads. > > No changes to on-disk format; this is an ingestion-side API only. > > ## MVP: sparse columns > > The minimum useful version is sparse-only: every column is allowed to skip > doc-ids or have multiple values per doc-id, and the chain goes through the > same per-doc paths it uses today (just driven by a cursor instead of an > IndexableField stream). This is enough to land the API, the two-pass > consumer, and the public addBatch entry point without touching the > doc-values / points writers. > > ## Follow-on option: dense columns > > The bigger performance wins come from advertising a column as dense — > every doc in [0, numDocs) has exactly one value. That lets the chain: > > - Skip the sparse-bitset bookkeeping in NumericDocValuesWriter / > SortedNumericDocValuesWriter entirely on the dense path. > - Bulk-fill straight into PackedLongValues from the column's values() > cursor, avoiding the per-value add loop. > - For 1-D numeric points, feed the BKD writer from the same dense > primitive cursor instead of one BytesRef at a time. > - For n-D numeric points, a fixed size binary column could feed multiple > document points in a single write. This is an expert scenario as users have > to serialize the points properly in sort order in the column. > > Density is asserted by the column up-front so the chain can pick the path > without probing. > > ## Follow-on option: Ergonomic builders > > I have focused on very low-level apis (abstract long and byte columns > implemented by users). Lucene could eventually add builders to create > columns easier (similar to IntField, LongField, etc). > > ## Indexed-only terms ("DOCS + no norms") as a column > > One more case worth flagging: fields indexed with IndexOptions.DOCS and no > norms — keyword/filter-style fields — don't need per-doc TokenStream > plumbing. A BinaryColumn over such a field can feed the postings writer > directly (one BytesRef per doc, no analysis, no norm accumulation). I have > not implemented this in my POC. > > ## Scope of the initial proposal > > - New package org.apache.lucene.document.column with ColumnBatch, Column, > LongColumn, BinaryColumn, and their cursors. > - New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through > DocumentsWriter / DocumentsWriterPerThread. > - Indexing-chain changes to support the two-pass consumer. > - All marked @lucene.experimental. > - Try to implement as much of the column oriented processing in the column > package to keep things experimental as long as possible. > > Would love feedback on if this is something Lucene is interested in or > would be open to. It would help significantly in the analytical case and > remove significant indirection and memory usage amplification on the > per-field allocations. > > Thanks, > Tim > > -- > Tim Brooks > [email protected] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Adrien
