Re: TopK strategy for vectorized chunks in Solr

Guillaume Tue, 02 Sep 2025 10:16:02 -0700

Hello Rahul,

Currently, I’m using the following topology:

* I index my documents records in the usual way.
* I index the chunks records by referencing their parent record id.

Concretely, this looks like (simplified version):

Document 1
  -id: DOC_1
  -title: 2025 Annual Report
  -document_type: PDF

Chunk 1 of document 1
  -id: CHUNK_1_1
  -text: <text of the first chunk>
  -vector: <embedding of the first chunk>
  -parent_id: DOC_1
  -position: 0

Chunk 2 of document 1
  -id: CHUNK_1_2
  -text: <text of the second chunk>
  -vector: <embedding of the second chunk>
  -parent_id: DOC_1
  -position: 1
…
…

When I want to retrieve documents via a semantic search on the chunks, I
use a join, like this:
q={!join from=parent_id to=id score=max}{!knn f=vector
topK=100}[0.255,0.36,…]

Using the aggregation guarantees that I won’t get duplicate documents in
the result set. However, even though I request 100 chunks (TopK), I’ll
probably get fewer documents because several chunks may belong to the same
document. I use the “max” aggregation to rank documents by their best chunk.

If I need to apply a filter on the **documents** (e.g., restrict the
semantic search to PDF documents), things get a bit more complicated
because the filtering must happen in the `preFilter` of the KNN search.
Here’s an example:

q={!join from=parent_id to=id score=max}{!knn f=vector topK=100
preFilter=$type_prefilter}[0.255,0.36,…]&type_prefilter={!join from=id
to=parent_id score=none} document_type:PDF

The pre‑filtering is performed on the **documents**. Then, the join fetches
the chunks associated with documents that satisfy the constraint
(`type:PDF`). Those resulting chunks become the corpus for the main
semantic search (through preFilter).

This indexing system works great for me because it lets me manage document
indexing and chunk indexing in a completely decoupled way. Solutions based
on “partial updates” or “nested documents” are problematic for me because I
can’t guarantee that all fields are `stored`, and I don’t want to have to
rebuild the documents when I index chunks.

I'm sure a better way to do that must exist. Especially because *joins
always end up becoming a problem as the number of documents grows* (even
with docValues).

Hope this helps you!

By the way, here’s an excellent video by Alessandro Benedetti that I
thought you might like :
https://youtu.be/9KJTbgtFWOU?si=YAUPNvfDhlX3NmJc&t=1450

Guillaume

Le dim. 31 août 2025 à 16:08, Sergio García Maroto <[email protected]> a
écrit :

> Hi Rahul.
>
> Have you explored the possibility of using streaming expressions? You can
> get back tuples and group
> them?
>
> Regards
> Sergio
>
> On Sun 31 Aug 2025 at 14:09, Rahul Goswami <[email protected]> wrote:
>
> > Hello,
> > Floating this up again in case anyone has any insights. Thanks.
> >
> > Rahul
> >
> > On Fri, Aug 15, 2025 at 11:45 AM Rahul Goswami <[email protected]>
> > wrote:
> >
> > > Hello,
> > > A question for folks using Solr as the vector db in their solutions. As
> > of
> > > now since Solr doesn't support parent/child or multi-valued vector
> field
> > > support for vector search, what are some strategies that can be used to
> > > avoid duplicates in top K results when you have vectorized chunks for
> the
> > > same (large) document?
> > >
> > > Would be also helpful to know how folks are doing this when storing
> > > vectors in the same docs as the lexical index vs when having the
> > vectorized
> > > chunks in a separate index.
> > >
> > > Thanks.
> > > Rahul
> > >
> >
>

Re: TopK strategy for vectorized chunks in Solr

Reply via email to