Re: Concurrent HNSW index

2023-04-27 Thread Michael Wechner

+1 for a pull request

Thanks

Michael

Am 27.04.23 um 20:53 schrieb Ishan Chattopadhyaya:

+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis,  wrote:

Hi all,

I've created an HNSW index implementation that allows for
concurrent build and querying.  On my i9-12900 (8 performance
cores and 8 efficiency) I get a bit less than 10x speedup of wall
clock time for building and querying the "siftsmall" and "sift"
datasets from http://corpus-texmex.irisa.fr/. The small dataset is
10k vectors while the large is 1M. This speedup feels pretty good
for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are
drop-in replacements for OnHeapHnswGraph and HnswGraphBuilder; I
use threadlocals to work around the places where the existing API
assumes no concurrency.

The concurrent classes also pass the existing test suite with the
exception of the ram usage ones; the estimator doesn't know about
AtomicReference etc.  (Big thanks to Michael Sokolov for
testAknnDiverse which made it much easier to track down subtle
problems!)

My motivation is

1. It is faster to query a single on-heap hnsw index, than to
query multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building
of the index, we still come out way ahead in terms of total
efficiency vs creating per-thread indexes and combining them,
since combining such indexes boils down to "pick the largest and
then add all the other nodes normally," you don't really benefit
from having computed the others previously.

I am currently adding this to Cassandra as code in our repo, but
my preference would be to upstream it.  Is Lucene open to a pull
request?

-- 
Jonathan Ellis

co-founder, http://www.datastax.com
@spyced



Re: Concurrent HNSW index

2023-04-27 Thread Ishan Chattopadhyaya
+1, please contribute to Lucene. Thanks!

On Thu, 27 Apr, 2023, 10:59 pm Jonathan Ellis,  wrote:

> Hi all,
>
> I've created an HNSW index implementation that allows for concurrent build
> and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
> a bit less than 10x speedup of wall clock time for building and querying
> the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
> The small dataset is 10k vectors while the large is 1M.  This speedup feels
> pretty good for a data structure that isn't completely parallelizable, and
> it's good to see that it's consistent as the dataset gets larger.
>
> The concurrent classes achieve identical recall compared to the
> non-concurrent versions within my ability to test it, and are drop-in
> replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
> to work around the places where the existing API assumes no concurrency.
>
> The concurrent classes also pass the existing test suite with the
> exception of the ram usage ones; the estimator doesn't know about
> AtomicReference etc.  (Big thanks to Michael Sokolov for testAknnDiverse
> which made it much easier to track down subtle problems!)
>
> My motivation is
>
> 1. It is faster to query a single on-heap hnsw index, than to query
> multiple such indexes and combine the result.
> 2. Even with some contention necessarily occurring during building of the
> index, we still come out way ahead in terms of total efficiency vs creating
> per-thread indexes and combining them, since combining such indexes boils
> down to "pick the largest and then add all the other nodes normally," you
> don't really benefit from having computed the others previously.
>
> I am currently adding this to Cassandra as code in our repo, but my
> preference would be to upstream it.  Is Lucene open to a pull request?
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Concurrent HNSW index

2023-04-27 Thread Jonathan Ellis
Hi all,

I've created an HNSW index implementation that allows for concurrent build
and querying.  On my i9-12900 (8 performance cores and 8 efficiency) I get
a bit less than 10x speedup of wall clock time for building and querying
the "siftsmall" and "sift" datasets from http://corpus-texmex.irisa.fr/.
The small dataset is 10k vectors while the large is 1M.  This speedup feels
pretty good for a data structure that isn't completely parallelizable, and
it's good to see that it's consistent as the dataset gets larger.

The concurrent classes achieve identical recall compared to the
non-concurrent versions within my ability to test it, and are drop-in
replacements for OnHeapHnswGraph and HnswGraphBuilder; I use threadlocals
to work around the places where the existing API assumes no concurrency.

The concurrent classes also pass the existing test suite with the exception
of the ram usage ones; the estimator doesn't know about AtomicReference
etc.  (Big thanks to Michael Sokolov for testAknnDiverse which made it much
easier to track down subtle problems!)

My motivation is

1. It is faster to query a single on-heap hnsw index, than to query
multiple such indexes and combine the result.
2. Even with some contention necessarily occurring during building of the
index, we still come out way ahead in terms of total efficiency vs creating
per-thread indexes and combining them, since combining such indexes boils
down to "pick the largest and then add all the other nodes normally," you
don't really benefit from having computed the others previously.

I am currently adding this to Cassandra as code in our repo, but my
preference would be to upstream it.  Is Lucene open to a pull request?

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: [JENKINS] Lucene » Lucene-Check-main - Build # 8942 - Still Failing!

2023-04-27 Thread Dawid Weiss
I filed an infras issue to see what's causing those failures.
https://issues.apache.org/jira/browse/INFRA-24524



On Thu, Apr 27, 2023 at 3:11 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Build: https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main/8942/
>
> All tests passed
>
> Build Log:
> [...truncated 75 lines...]
>
> -
> To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
> For additional commands, e-mail: builds-h...@lucene.apache.org


Re: New branch and feature freeze for Lucene 9.6.0

2023-04-27 Thread Alan Woodward
I have started a release note here: 
https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes+9.6

> On 27 Apr 2023, at 09:45, Alan Woodward  wrote:
> 
> I have successfully wrestled Jenkins into submission, and there are now 9.6 
> jobs for Artifacts, Check and NightlyTests.
> 
>> On 26 Apr 2023, at 16:53, Alan Woodward  wrote:
>> 
>> NOTICE:
>> 
>> Branch branch_9_6 has been cut and versions updated to 9.7 on stable branch.
>> 
>> Please observe the normal rules:
>> 
>> * No new features may be committed to the branch.
>> * Documentation patches, build patches and serious bug fixes may be
>> committed to the branch. However, you should submit all patches you
>> want to commit as pull requests first to give others the chance to review
>> and possibly vote against them. Keep in mind that it is our
>> main intention to keep the branch as stable as possible.
>> * All patches that are intended for the branch should first be committed
>> to the unstable branch, merged into the stable branch, and then into
>> the current release branch.
>> * Normal unstable and stable branch development may continue as usual.
>> However, if you plan to commit a big change to the unstable branch
>> while the branch feature freeze is in effect, think twice: can't the
>> addition wait a couple more days? Merges of bug fixes into the branch
>> may become more difficult.
>> * Only Github issues with Milestone 9.6
>> and priority "Blocker" will delay a release candidate build.
>> 
>> 
>> I am struggling to find the lucene Jenkins jobs on the new apache build 
>> server at https://jenkins-ccos.apache.org/ - if anybody has any hints as to 
>> how to navigate the helpful new interface with a non-functional search box, 
>> I would be very grateful…
>> 
>> It’s a holiday weekend coming up in the UK, so my plan is to give Jenkins a 
>> few days to chew things over (once I actually get the jobs running) and then 
>> build a RC on Tuesday 2nd May.
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: New branch and feature freeze for Lucene 9.6.0

2023-04-27 Thread Alan Woodward
I have successfully wrestled Jenkins into submission, and there are now 9.6 
jobs for Artifacts, Check and NightlyTests.

> On 26 Apr 2023, at 16:53, Alan Woodward  wrote:
> 
> NOTICE:
> 
> Branch branch_9_6 has been cut and versions updated to 9.7 on stable branch.
> 
> Please observe the normal rules:
> 
> * No new features may be committed to the branch.
> * Documentation patches, build patches and serious bug fixes may be
> committed to the branch. However, you should submit all patches you
> want to commit as pull requests first to give others the chance to review
> and possibly vote against them. Keep in mind that it is our
> main intention to keep the branch as stable as possible.
> * All patches that are intended for the branch should first be committed
> to the unstable branch, merged into the stable branch, and then into
> the current release branch.
> * Normal unstable and stable branch development may continue as usual.
> However, if you plan to commit a big change to the unstable branch
> while the branch feature freeze is in effect, think twice: can't the
> addition wait a couple more days? Merges of bug fixes into the branch
> may become more difficult.
> * Only Github issues with Milestone 9.6
> and priority "Blocker" will delay a release candidate build.
> 
> 
> I am struggling to find the lucene Jenkins jobs on the new apache build 
> server at https://jenkins-ccos.apache.org/ - if anybody has any hints as to 
> how to navigate the helpful new interface with a non-functional search box, I 
> would be very grateful…
> 
> It’s a holiday weekend coming up in the UK, so my plan is to give Jenkins a 
> few days to chew things over (once I actually get the jobs running) and then 
> build a RC on Tuesday 2nd May.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org