I poked around on huggingface looking at various models that are being promoted there; this is the highest-performing text model they list, which is expected to take sentences as input; it uses so-called "attention" to capture the context of words: https://huggingface.co/sentence-transformers/all-mpnet-base-v2 and it is 768-dimensional. This is a list of models designed for "asymmetric semantic search" ie short queries and long documents: https://www.sbert.net/docs/pretrained-models/msmarco-v3.html. The highest ranking one there also seems to be 768d https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco
I did see some other larger-dimensional model, but they all seem to involve images+text. On Mon, Apr 10, 2023 at 9:54 AM Michael Sokolov <msoko...@gmail.com> wrote: > > I think concatenating word-embedding vectors is a reasonable thing to > do. It captures information about the sequence of tokens which is > being lost by the current approach (summing them). Random article I > found in a search > https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca > shows higher performance with a concatenative approach. So it seems to > me we could take the 300-dim Glove vectors and produce somewhat > meaningful (say) 1200- or 1500-dim vectors by running a sliding window > over the tokens in a document and concatenating the token-vectors > > On Sun, Apr 9, 2023 at 2:44 PM Dawid Weiss <dawid.we...@gmail.com> wrote: > > > > > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 > > > and 300 dimensional varieties and can easily enough generate large > > > numbers of vector documents from the articles data. To go higher we could > > > concatenate vectors from that and I believe the performance numbers would > > > be plausible. > > > > Apologies - I wasn't clear - I thought of building the 1k or 2k > > vectors that would be realistic. Perhaps using glove or perhaps using > > some other software but something that would reflect a true 2k > > dimensional space accurately with "real" data underneath. I am not > > familiar enough with the field to tell whether a simple concatenation > > is a good enough simulation - perhaps it is. > > > > I would really prefer to focus on doing this kind of assessment of > > feasibility/ limitations rather than arguing back and forth. I did my > > experiment a while ago and I can't really tell whether there have been > > improvements in the indexing/ merging part - your email contradicts my > > experience Mike, so I'm a bit intrigued and would like to revisit it. > > But it'd be ideal to work with real vectors rather than a simulation. > > > > Dawid > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org