magibney commented on a change in pull request #476: URL: https://github.com/apache/solr/pull/476#discussion_r789773406
########## File path: solr/solr-ref-guide/src/dense-vector-search.adoc ########## @@ -0,0 +1,308 @@ += Dense Vector Search +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +The Apache Solr *Dense Vector Search* module adds support for indexing and searching dense numerical vectors. + +https://en.wikipedia.org/wiki/Deep_learning[Deep learning] can be used to produce a vector representation of both the query and the documents in a corpus of information. + +These neural network-based techniques are usually referred to as neural search, an industry derivation from the academic field of https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf[Neural information Retrieval]. + +== Important Concepts + +=== Dense Vector Representation +A dense vector describes information as an array of elements, each of them explicitly defined. Review comment: Thanks, Alessandro! I think maybe we should go either/or with reference to "Bag of words" vs. "Inverted index". In the text I drafted I referred to an "inverted index" as an example of sparse vector representation because I figured it would likely be the most familiar reference point for the "Solr refguide" audience. I note that Wikipedia has two separate entries: [Bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) and [Inverted index](https://en.wikipedia.org/wiki/Inverted_index), neither of which refers directly to the other, but both of which include a "See also" reference to the page for [Vector space model](https://en.wikipedia.org/wiki/Vector_space_model). The wikipedia pages are explicit about one difference, with the second clearly describing how most people use Solr: 1. "The Bag-of-words model is an orderless document representation — only the counts of words matter." 2. "A word-level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document.[2] The latter form offers more functionality (like phrase searches)" Either model is arguably a valid example of a "sparse vector" representation. My practical concern is that in this context, it would be easy to misinterpret the reference to "Bag of words" as an implicit reference to "_non-vector_ search in Solr", whatever the intention of using "Bag-of-words" as a point of reference. I think that interpretation would be misleading, and could obscure the true distinction between "dense vector search" and classic TF-IDF/BM25/phrase-boosted/etc. search in Solr, and the appropriate use cases for each approach. (minor point: I'd also be inclined to drop the separate heading for "Sparse Vector Representation" -- I was purposefully vague in saying `can be considered to model text as a "sparse" vector`; perhaps I'm mistaken, but my impression is that "sparse" retrieval models (as _accurate_ as that characterization may be), are most often characterized as "sparse" as a foil to explain why "dense retrieval" is characterized as "dense". Indeed, that's what we're doing here!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org