Very cool! I remember wanting something like this when I was looking into
making Lucene a bit better at ingesting small structured documents like
IndexGeoNames (
https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/extra/perf/IndexGeoNames.java).
Is your POC complete enough to get a sense of the speedup that we'd get on
this benchmark?

On Tue, Apr 28, 2026 at 3:53 AM Tim Brooks <[email protected]> wrote:

> Hi all,
>
> I'd like to propose adding a column-oriented document-ingestion API to
> IndexWriter and get early feedback on the shape before opening a PR. I've
> been prototyping this on a branch and would like to understand community
> appetite before pushing further.
>
> https://github.com/apache/lucene/pull/15990
>
> ## The concept
>
> Today IndexWriter consumes an Iterable<IndexableField> per document: the
> indexing chain walks each field, re-resolves FieldInfo / PerField state,
> revalidates the field type against the schema, and interleaves
> stored-fields, postings, doc-values and points per document.
>
> The proposal is to add a parallel intake path:
> IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns,
> where each Column represents one field across all documents in the batch.
> The indexing chain then processes the batch in two passes:
>
> 1. A row-oriented pass for stored fields and the inverted index (per-doc
> processing still matters there).
> 2. A column-oriented pass for doc values, vectors, and points (where
> per-field bulk writes are a natural fit).
>
> Column itself is just metadata (name, IndexableFieldType, density).
> Iteration happens through typed cursors obtained from the subclasses:
> LongColumn for numeric DV, 1-D points, and numeric stored; BinaryColumn for
> binary/sorted DV, text/binary stored, and binary-encoded points; and
> VectorColumn for KNN vectors. Each cursor call returns a fresh cursor, so a
> column can be traversed once in the row pass and again in the column pass.
>
> ## Two benefits motivate this:
>
> 1. More compact in-memory representation during indexing. A column batch
> avoids the per-field allocations of the document-at-a-time path
> (IndexableField instances, per-doc FieldType references, per-doc attribute
> maps). For numeric DV and points in particular, the caller can hand us a
> primitive-backed cursor that the chain drains directly into
> PackedLongValues / the points writer without indirection.
> 2. Less redundant field validation. Field name, type, indexing options,
> and schema compatibility are resolved once per column instead of once per
> IndexableField. For workloads where a caller already knows the schema of a
> batch, that revalidation is pure overhead.
>
> All in all, these changes drop CPU usage dedicated to
> IndexWriter#addDocuments 4-5x for analytic heavy workloads.
>
> No changes to on-disk format; this is an ingestion-side API only.
>
> ## MVP: sparse columns
>
> The minimum useful version is sparse-only: every column is allowed to skip
> doc-ids or have multiple values per doc-id, and the chain goes through the
> same per-doc paths it uses today (just driven by a cursor instead of an
> IndexableField stream). This is enough to land the API, the two-pass
> consumer, and the public addBatch entry point without touching the
> doc-values / points writers.
>
> ## Follow-on option: dense columns
>
> The bigger performance wins come from advertising a column as dense —
> every doc in [0, numDocs) has exactly one value. That lets the chain:
>
> - Skip the sparse-bitset bookkeeping in NumericDocValuesWriter /
> SortedNumericDocValuesWriter entirely on the dense path.
> - Bulk-fill straight into PackedLongValues from the column's values()
> cursor, avoiding the per-value add loop.
> - For 1-D numeric points, feed the BKD writer from the same dense
> primitive cursor instead of one BytesRef at a time.
> - For n-D numeric points, a fixed size binary column could feed multiple
> document points in a single write. This is an expert scenario as users have
> to serialize the points properly in sort order in the column.
>
> Density is asserted by the column up-front so the chain can pick the path
> without probing.
>
> ## Follow-on option: Ergonomic builders
>
> I have focused on very low-level apis (abstract long and byte columns
> implemented by users). Lucene could eventually add builders to create
> columns easier (similar to IntField, LongField, etc).
>
> ## Indexed-only terms ("DOCS + no norms") as a column
>
> One more case worth flagging: fields indexed with IndexOptions.DOCS and no
> norms — keyword/filter-style fields — don't need per-doc TokenStream
> plumbing. A BinaryColumn over such a field can feed the postings writer
> directly (one BytesRef per doc, no analysis, no norm accumulation). I have
> not implemented this in my POC.
>
> ## Scope of the initial proposal
>
> - New package org.apache.lucene.document.column with ColumnBatch, Column,
> LongColumn, BinaryColumn, and their cursors.
> - New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through
> DocumentsWriter / DocumentsWriterPerThread.
> - Indexing-chain changes to support the two-pass consumer.
> - All marked @lucene.experimental.
> - Try to implement as much of the column oriented processing in the column
> package to keep things experimental as long as possible.
>
> Would love feedback on if this is something Lucene is interested in or
> would be open to. It would help significantly in the analytical case and
> remove significant indirection and memory usage amplification on the
> per-field allocations.
>
> Thanks,
> Tim
>
> --
>   Tim Brooks
>   [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Adrien

Reply via email to