[GitHub] [solr] magibney commented on a change in pull request #476: SOLR-15880: K Nearest Neighbors Search

GitBox Thu, 20 Jan 2022 08:48:19 -0800


magibney commented on a change in pull request #476:
URL: https://github.com/apache/solr/pull/476#discussion_r788954745




##########
File path: solr/solr-ref-guide/src/dense-vector-search.adoc
##########
@@ -0,0 +1,308 @@
+= Dense Vector Search
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+The Apache Solr *Dense Vector Search* module adds support for indexing and 
searching dense numerical vectors.
+
+https://en.wikipedia.org/wiki/Deep_learning[Deep learning] can be used to 
produce a vector representation of both the query and the documents in a corpus 
of information.
+
+These neural network-based techniques are usually referred to as neural 
search, an industry derivation from the academic field of 
https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf[Neural
 information Retrieval].
+
+== Important Concepts
+
+=== Dense Vector Representation 
+A dense vector describes information as an array of elements, each of them 
explicitly defined.

Review comment:
       Ahh I see! Sorry for the delayed response. I'll take a crack at this, 
feel free to take or leave this suggestion:
   
   >=== Dense Vector Representation 
   >A traditional "tokenized inverted index" can be considered to model text as 
a "sparse" vector, in which each term in the index corresponds to one vector 
dimension. In such a model, the number of dimensions is generally quite high 
(corresponding to term cardinality), and the vector for any given document 
contains mostly "zeros" (hence it is "sparse", as only a handful of terms that 
exist in the overall index will be present in any given document).
   >
   >"Dense vector" representation contrasts with term-based "sparse vector" 
representation in that it distills semantic meaning into a fixed (and limited) 
number of dimensions. The number of dimensions in this approach is generally 
much lower than the "sparse" case, and the vector for any given document is 
"dense", as most of its dimensions are populated by non-zero values.
   >
   >Solr exposes the capability to index and search dense vectors; but in 
contrast to the "sparse" approach (for which Solr provides tokenizers to 
"generate sparse vectors" directly from text input) the task of _generating_ 
vectors must be handled in application logic external to Solr. There may be 
cases where it makes sense to directly search data that natively exists as a 
vector (e.g., scientific data); but in a "text search" context, it is likely 
that users will leverage deep learning models such as 
https://en.wikipedia.org/wiki/BERT_(language_model)[BERT] to encode textual 
information as dense vectors, supplying the resulting vectors to Solr 
explicitly at index and query time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] magibney commented on a change in pull request #476: SOLR-15880: K Nearest Neighbors Search

Reply via email to