Thanks, Alex! That's super helpful. Did you by any chance save any of your benchmarking results? I'd be curious to see them. I'm not super current here, but it looks like as of JDK11 VarHandles might be the supported way to do this sort of thing? I'm also curious about whether you have any experience with the FloatBuffer/ByteBuffer/MappedByteBuffer mechanism for memory-mapping a file?
-Mike On Thu, Jul 16, 2020 at 9:57 PM Alex Klibisz (Jira) <j...@apache.org> wrote: > > > [ > https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159599#comment-17159599 > ] > > Alex Klibisz commented on LUCENE-9322: > -------------------------------------- > > Hi all. Some great discussion here and in #9004 and #9136. > > I've been working on an Elasticsearch plugin for ANN for about 8 months now: > [http://elastiknn.klibisz.com/ |http://elastiknn.klibisz.com/]Obviously using > Lucene under-the-hood but I'm definitely more fluent in Elasticsearch > concepts than Lucene internals. > > Figured I would mention: One of the early bottlenecks was vector > serialization (using BinaryDocValues to store the vectors). I did extensive > benchmarking to figure out the fastest way to de-/serialize `float[]` and > `int[]` arrays to/from byte arrays. In the end I ended up finding the > `sun.misc.Unsafe` module beat all others. Here's the Java utility class that > I'm using for de-/serialization in my plugin: > [https://github.com/alexklibisz/elastiknn/blob/adf8262907093315d772ae524e822a1152b0e929/core/src/main/java/com/klibisz/elastiknn/storage/UnsafeSerialization.java] > > Maybe it can be helpful. > > > > > Discussing a unified vectors format API > > --------------------------------------- > > > > Key: LUCENE-9322 > > URL: https://issues.apache.org/jira/browse/LUCENE-9322 > > Project: Lucene - Core > > Issue Type: New Feature > > Reporter: Julie Tibshirani > > Priority: Major > > > > Two different approximate nearest neighbor approaches are currently being > > developed, one based on HNSW (LUCENE-9004) and another based on coarse > > quantization ([#LUCENE-9136]). Each prototype proposes to add a new format > > to handle vectors. In LUCENE-9136 we discussed the possibility of a unified > > API that could support both approaches. The two ANN strategies give > > different trade-offs in terms of speed, memory, and complexity, and it’s > > likely that we’ll want to support both. Vector search is also an active > > research area, and it would be great to be able to prototype and > > incorporate new approaches without introducing more formats. > > To me it seems like a good time to begin discussing a unified API. The > > prototype for coarse quantization > > ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to > > commit soon (this depends on everyone's feedback of course). The approach > > is simple and shows solid search performance, as seen > > [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326]. > > I think this API discussion is an important step in moving that > > implementation forward. > > The goals of the API would be > > # Support for storing and retrieving individual float vectors. > > # Support for approximate nearest neighbor search -- given a query vector, > > return the indexed vectors that are closest to it. > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org > For additional commands, e-mail: issues-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org