[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2022-08-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582097#comment-17582097
 ] 

ASF subversion and git services commented on LUCENE-9583:
-

Commit 8308688d786cd6c55fcbe4e59f67966f385989a2 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8308688d786 ]

LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071)

This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.

This is a small simplification related to LUCENE-9583, but does not address the
main issue.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Michael Sokolov
>Assignee: Julie Tibshirani
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2021-11-10 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441993#comment-17441993
 ] 

Julie Tibshirani commented on LUCENE-9583:
--

Sorry for the long radio silence on this one! I actually tried removing the 
public RandomAccess interface and got stuck. I was able to rework most 
interfaces so that only HNSW logic needed random access, and the public 
VectorValues interface could drop support. The part that presented a problem 
was merging, which uses a merged view of all the segments' VectorValues. We use 
this merged VectorValues to build the merged HNSW graph, so it needs to support 
random access. But these segments could have any type of vectors format, not 
just HNSW, so I couldn't guarantee they supported random access.

[~jpountz] had an idea to first write out the merged VectorValues to a file, 
then build the merged HNSW graph based on the combined VectorValues. This seems 
worth exploring to me. Perhaps it could also save effort while building the 
merged graph, since we wouldn't need to translate from the merged view to the 
individual vector values.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: main (10.0)
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-13 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231499#comment-17231499
 ] 

Michael Sokolov commented on LUCENE-9583:
-

bq. Perhaps we could revisit this issue once the first ANN implementation is 
completed?

[~jtibshirani] that makes sense.  We can leave this open, even though the 
attached PR was pushed. I just pushed LUCENE-9004 as well, implementing NSW 
graph indexing, so that should give us a more concrete basis for comparison. I 
have been testing performance (recall/latency) using a KnnGraphTester class 
that is part of that. However one challenge is coming up with a test dataset we 
can share. I have been using some proprietary embeddings, getting good results, 
and just started looking into testing with GloVe, and got not-so-good results 
there.  I am concerned that GloVe may have some strong clustering and require 
us to implement the diversity heuristic from the HNSW paper.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-10 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229418#comment-17229418
 ] 

Julie Tibshirani commented on LUCENE-9583:
--

It's great that the 'search' method is now on the main VectorValues interface.

> By "wrong message" I mean that we require two implementations where only one 
> is needed. It will be difficult to optimize one type of access without 
> hurting the other so I'd lean toward a single pattern.
> So I would propose revisiting this once LUCENE-9004 lands.

To me there's still an open question around whether the public interface should 
support both access patterns (random and iterator-based), and if not which one 
should be chosen. Perhaps we could revisit this issue once the first ANN 
implementation is completed? I think it will be easier to understand the 
options from a code-level, and easier for us to contribute/ assess refactoring 
ideas.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228677#comment-17228677
 ] 

ASF subversion and git services commented on LUCENE-9583:
-

Commit 8be0cea5442c2edab260d0598b920ba832506f80 in lucene-solr's branch 
refs/heads/master from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8be0cea ]

LUCENE-9583: extract separate RandomAccessVectorValues interface (#2037)



> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-06 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227534#comment-17227534
 ] 

Michael Sokolov commented on LUCENE-9583:
-

I worked up a version of LUCENE-9004 that uses docids in all public APIs, while 
still exposing random access via docid. This led to a measurable slowdown since 
it requires mapping back-and-forth between docids and ordinals.

These are results on an internal dataset. I'm showing the median latency of 
three runs in each case. The net is about a 10% increase in query latency and 
about 28% increase in indexing time.

h3. using ordinal API

||recall|| latency|| nDoc|| fanout|| maxConn|| beamWidth|| visited|| index ms||
|0.914| 1.73| 100 |0| 64| 500| 2126| 4628273|
|0.921 |1.86| 100 |10| 64| 500| 2260| 0|
|0.924 |2.02 |100 |20 |64 |500 |2389 |0|
|0.941 | 2.27 |100 |40 |64| 500 |2644| 0|

h3. using docId-only API

||recall|| latency|| nDoc|| fanout|| maxConn|| beamWidth|| visited|| index ms||
|0.910| 1.92| 100| 0| 64| 500| 2084| 5929137|
|0.920| 2.05| 100| 10| 64| 500| 2217| 0|
|0.949| 2.21| 100| 20| 64| 500| 2399| 0|
|0.959| 2.51| 100| 40| 64| 500| 2671| 0|

Please note that there is precedence for exposing "internal" ordinals as part 
of our API in \{SortedDocValues}, so we shouldn't shy away from that if it 
brings value.

I haven't had time to try out forward-only iteration, but I do expect it would 
introduce some marginal performance regression and considerably complicate the 
implementation of hnsw at least. Finally I'll remind everyone that we have a 
perfectly good forward-only iteration API for fetching binary data 
(BinaryDocValues), and that the genesis of this format was indeed the need for 
random access over vectors.

I'd appreciate it if folks with concerns could review the attached PR, which I 
think does a credible job of moving the random-access API into a place where it 
doesn't intrude on the main VectorValues API. That patch has been out for a 
week or so and I plan to push it soon if there are no further comments there 
(thanks for approving, @mccandless!). I recognize this topic is somewhat 
controversial, but I believe we can make rapid progress by iterating on code 
and measuring results.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-01 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224262#comment-17224262
 ] 

Michael Sokolov commented on LUCENE-9583:
-

I did fairly recently work up a version of LUCENE-9322/LUCENE-9004  based on a 
graph of docids, without random access, and I guess I didn't like the way it 
came out. Maybe we'll end up going back there, given the weight of opinion on 
this thread, but for the record, the things I didn't like:

1. An additional lookup is required for every access (mapping docid to internal 
storage offset). Additional memory is required especially when merging sorted 
indexes.
2. I found that in addition to reset(), I needed to add clone() because these 
iterators/accessors have internal state (the vector value), and clone() allows 
you to have two at once over the same underlying vectors that don't interfere 
with each other. So eg when constructing the graph, one accessor traverses all 
the values, while the other is used to pull related values for comparison. And 
of course calling reset/advance/access means three calls when one would do.
3. In order to efficiently limit the number of outgoing nodes in the graph, we 
would want to keep a score-based priority queue per node (while constructing 
the graph). This means nodes are stored unordered, and in order to use a 
forward-only iterator to traverse a nodes' neighbors, they must be sorted 
repeatedly during graph traversal (or reset() called between every lookup).
4. We don't have any concrete ideas how we would make use of a restriction to 
forward-only iteration. Maybe in the future there would be an optimization that 
requires it, but so far I haven't seen any proposal.

I do agree that exposing an API based on internal ordinals (not docids) is 
somewhat unsatisfying. Maybe we can bury it more deeply in the codec, or maybe 
there's a middle ground here, an API with direct lookup using docids?

I did not do careful performance studies of these different approaches, and I 
take the point that differences might be small. Speed is a feature here, so I 
don't think we should ignore it, and some data should help us make a decision. 
I can try to work up a study (KnnGraphTester is what I am using to measure 
perf) to see if there's a difference. Maybe I can resurrect the docid-based 
patch I had working at some point, and even try restricting to forward-only 
iteration, which I have to say to me seems radical!


> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-01 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224226#comment-17224226
 ] 

Tomoko Uchida commented on LUCENE-9583:
---

bq. I wonder if the graph approach really needs random access. For each node we 
need to access the list of neighbors so this pattern doesn't require random 
access because the list is already sorted by doc ids. So instead of adding 
another interface I wonder what do you think of adding a reset method in 
VectorValues ? For each node, the pattern to access would be to reset the 
iterator first and then move it to the first neighbor. We can make optimization 
internally to provide fast reset so that we don't need two implementations for 
the first two approaches that we foresee ?

I would prefer this approach, encouraging forward only iteration and calling 
reset() when needed, even if it could look a bit non-intuitive to implement 
graph based aknn search.
I feel exposing public APIs for "random" access pattern needs more careful 
decision and we should start from conservative ways (we already discussed about 
that several times and couldn't reach an agreement so far, according to my 
understanding).

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-30 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223706#comment-17223706
 ] 

Jim Ferenczi commented on LUCENE-9583:
--

By "wrong message" I mean that we require two implementations where only one is 
needed. It will be difficult to optimize one type of access without hurting the 
other so I'd lean toward a single pattern. If it's random access so be it but 
the pros/cons should be considered carefully. The forward iterator design is 
constraining but it also forces to think about how to access the data 
efficiently.

> Yes, think of parent/child index with vectors only on the parent

I see these ordinals as an optimization detail of how the graph stores things. 
I don't think they should be exposed at all since the user should interact with 
doc ids directly. It's something that could come later if needed but that 
sounds like a complexity that we could avoid when introducing a new format. We 
don't need to optimize for the sparse case, at least not yet ;).

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-30 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223684#comment-17223684
 ] 

Michael Sokolov commented on LUCENE-9583:
-

> Ok so we need dense ordinals because you expect to have lots of documents 
> without a value ?

Yes, think of parent/child index with vectors only on the parent

> If random access is the way to go then we don't need the forward iterator but 
> I agree with Julie that it maybe send the wrong message which is why I 
> proposed to add the reset method.

It's true we may not need a forward iterator. I'm not sure what you mean by the 
"wrong message" though.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-29 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223181#comment-17223181
 ] 

Jim Ferenczi commented on LUCENE-9583:
--

Ok so we need dense ordinals because you expect to have lots of documents 
without a value ? What's the intent ? Reducing the size to encode the doc ids 
in the graph ? It seems premature to me so I was wondering why we require two 
interface here. I don't understand why we have to keep two implementations. If 
random access is the way to go then we don't need the forward iterator but I 
agree with Julie that it maybe send the wrong message which is why I proposed 
to add the reset method.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-28 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222399#comment-17222399
 ] 

Michael Sokolov commented on LUCENE-9583:
-

Thanks Jim, I think that approach could be workable. But it does overlook that 
the random access API now in 9.0  is defined in terms of dense *ordinals* not 
(possibly sparse) *docIds*. And I do think that there will have been some 
non-negligible savings from a simple call of {{vectorValue(int ordinal)}} as 
oppose to: {{advance(int docID)}} {{vectorValue()}} (and occasionally 
{{reset()}}, especially given that advancing by docID will have to at some 
point make use of the docID/ordinal mapping. (Aside: if we impose this 
forward-only-iterator constraint, we could very well stick with 
{{BinaryDocValues}} and dispense with this format altogether, as my original 
patch had done (when I got the opposite comment - that we should add a new 
format that supports random access :).

Still, maybe the savings is not so great? Maybe encoding docID in the graph is 
workable and we don't pay too much lookup cost switching back and forth between 
docIDs and ordinals? Maybe we benefit from not using random access in some way 
that I don't fully grok? Having said that, I think it would be clearer if we 
could get the graph implementation committed; then we could assess the impact 
more clearly, rather than trying to determine it in a theoretical vein. So I 
would propose revisiting this once LUCENE-9004 lands.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-28 Thread Jim Ferenczi (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1745#comment-1745
 ] 

Jim Ferenczi commented on LUCENE-9583:
--

I wonder if the graph approach really needs random access. For each node we 
need to access the list of neighbors so this pattern doesn't require random 
access because the list is already sorted by doc ids. So instead of adding 
another interface I wonder what do you think of adding a reset method in 
VectorValues ? For each node, the pattern to access would be to reset the 
iterator first and then move it to the first neighbor. We can make optimization 
internally to provide fast reset so that we don't need two implementations for 
the first two approaches that we foresee ?

 

> Without it, we would need to load all vectors into RAM while 
>flushing/merging, as we currently do in {{BinaryDocValuesWriter.BinaryDVs}}. I 
>wonder if it's worth paying this cost for the simpler API

 

We only do that when flushing so the amount of data is limited. 

The only case where this could be an issue is if you're sorting your index 
after the fact. That's a new feature we added recently and in this specific 
case we have to load the entire binary field in memory. However this is not 
specific to vectors, any binary field is affected so it's a more generic 
problem that we don't need to solve now or maybe never ;).

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-27 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221445#comment-17221445
 ] 

Michael Sokolov commented on LUCENE-9583:
-

The attached PR follows up on the idea of splitting out RandomAccess into a 
pair of new top-level interfaces: RandomAccessVectorValues and 
RandomAccessVectorValuesProducer. I think it will help with making the public 
API clearer. Is there a blessed way to mark such interfaces as 
internal-use-only, even though they must be public in order to be visible 
across packages internally?

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-26 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220770#comment-17220770
 ] 

Michael Sokolov commented on LUCENE-9583:
-

We could maybe move {{VectorValues.RandomAccess}} and the 
{{VectorValues.randomAccess()}} method to a standalone interface: 
{{RandomAccessVector}} or so (maybe we'd need two interfaces - one for the RA 
interface itself and another for producers of it. This standalone interface 
could even maybe live in codecs to make it seem more internal/expert, although 
it would maybe be weird to put it there? I'm nbot totally clear on the split 
between index and codecs. At least if we did this it would no longer jump out 
at you as part of VectorValues, although it would have to be public (unless we 
also moved stuff from VectorValuesWriter to codecs, then we could make it 
package private) . Then we could have the existing implementations in 
codecs/lucene90 implement this interface and use typecasts to get access to it.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-10-25 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220400#comment-17220400
 ] 

Julie Tibshirani commented on LUCENE-9583:
--

> Without it, we would need to load all vectors into RAM while 
> flushing/merging, as we currently do in BinaryDocValuesWriter.BinaryDVs. I 
> wonder if it's worth paying this cost for the simpler API.

This made me notice that we ignore vectors in {{SortingCodecReader}}, which can 
be used to sort an index after it's already been created. I opened 
https://github.com/apache/lucene-solr/pull/2028 to address this.

I'm not an expert in this code, but to me the trade-off seems worth it for a 
well-scoped API. Having a tighter set of methods makes it clear to callers how 
vectors are intended to be used: for retrieving docs through kNN or as a 
contributor to document scores. And there could still be room for future 
optimizations to avoid reloading the vectors? For example, when flushing we 
always work with {{BufferedVectorValues}} -- maybe {{SortingVectorValues}} 
could take that in directly.

> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
> to {{VectorValues.RandomAccess}}. This I think we could move back, and handle 
> the HNSW requirements for search elsewhere.

This seems like a nice change, no matter what happens with {{RandomAccess}} !

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org