Re: Intra-segment search concurrency implementation

Luca Cavanna Thu, 01 Aug 2024 03:58:36 -0700

Hey Alan,
Thanks for the feedback.

I need to give it some more thought, but I kind of assumed that we would
not want to create different instances of leaf reader context for
partitions of the same segment. The mapping between the physical layout of
a segment and leaf reader context should remain 1:1. I need to think more
about implications of the approach you are suggesting.


This would certainly make it more difficult to work around the hits
counting issue I encountered. It sounds like we should be able to tell
somehow when the same physical segment backs multiple partitions that are
being searched, probably even more once we address the scorer supplier
duplicated work.

I also see the partition abstraction as a rather internal concept, in that
it is only visible to slices generation, without leaking out to consumers,
where everything is still Leaf reader context based.

Cheers
Luca


On Wed, Jul 31, 2024, 11:36 Alan Woodward <romseyg...@gmail.com> wrote:

> Hi Luca,
>
> This is very exciting!  I haven’t followed the dev process very closely so
> far, so this may already have been looked at and dismissed as unworkable
> for various reasons, but I’m wondering if we definitely need a new
> abstraction for a LeafReaderContext partition?  Could we instead find a way
> to make IndexReader.leaves() return a view over the various segments that
> splits large segments into multiple LeafReaderContexts with different
> subsets of the docId space marked as deleted?
>
> I suppose we could lose some optimisations in count() implementations, but
> maybe it would be possible to check up-front if the count() for a segment
> returns -1 and only do the split in that case.
>
> - Alan
>
> On 29 Jul 2024, at 22:45, Luca Cavanna <java...@apache.org> wrote:
>
> Hey all,
> I have been working on an initial implementation of intra-segment search
> concurrency for Lucene.
>
> My goal is to introduce the ability to concurrently search partitions of
> the same segment, think of a force-merged segment for instance, in a way
> that's as transparent as possible to users. This way we can ideally
> decouple search concurrency from the index geometry, with the least impact
> on users. As part of my initial step, I decided to not tackle deduplicating
> work that happens globally per segment, which every partition would repeat
> on its own. This is certainly an important area to improve upon, yet I am
> hoping that we can treat it as a follow-up, mostly because there is enough
> work to do even without addressing that.
>
> After quite a few iterations, I have just marked my PR ready for review:
> https://github.com/apache/lucene/pull/13542. Tests are finally green. I
> wrote a rather detailed description on the PR itself that includes the
> problems I encountered, how I addressed them, and the way forward that I am
> proposing. There are still a couple of rough edges, and needed alignment on
> terminology API-wise. Mostly, what do we call a partition of a segment?
> Existing leaf slices are partitions of an index. We are now introducing
> partitions of segments that can be searched independently. I called them
> LeafReaderContextPartition, but I am not particularly attached to this
> specific name and open to feedback. This new terminology is only applied to
> the IndexSearcher#search method (not called directly by users though) and
> the IndexSearcher slices related methods. Otherwise, users that just call
> search don't need to necessarily know what a segment partition is,
> hopefully.
>
> I'd love to collect enough feedback to agree on a path forward and get
> this merged for Lucene 10, as it requires some API breaking changes as well
> as changes in internal behaviour.
>
>
> Looking forward to your feedback
>
> Cheers
> Luca
>
>
>
>
>

Re: Intra-segment search concurrency implementation

Reply via email to