I poked around on huggingface looking at various models that are being
promoted there; this is the highest-performing text model they list,
which is expected to take sentences as input; it uses so-called
"attention" to capture the context of words:
https://huggingface.co/sentence-transformers/all-mpnet-base-v2 and it
is 768-dimensional. This is a list of models designed for "asymmetric
semantic search" ie short queries and long documents:
https://www.sbert.net/docs/pretrained-models/msmarco-v3.html. The
highest ranking one there also seems to be 768d
https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco

 I did see some other larger-dimensional model, but they all seem to
involve images+text.

On Mon, Apr 10, 2023 at 9:54 AM Michael Sokolov <msoko...@gmail.com> wrote:
>
> I think concatenating word-embedding vectors is a reasonable thing to
> do. It captures information about the sequence of tokens which is
> being lost by the current approach (summing them). Random article I
> found in a search
> https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
> shows higher performance with a concatenative approach. So it seems to
> me we could take the 300-dim Glove vectors and produce somewhat
> meaningful (say) 1200- or 1500-dim vectors by running a sliding window
> over the tokens in a document and concatenating the token-vectors
>
> On Sun, Apr 9, 2023 at 2:44 PM Dawid Weiss <dawid.we...@gmail.com> wrote:
> >
> > > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 
> > > and 300 dimensional varieties and can easily enough generate large 
> > > numbers of vector documents from the articles data. To go higher we could 
> > > concatenate vectors from that and I believe the performance numbers would 
> > > be plausible.
> >
> > Apologies - I wasn't clear - I thought of building the 1k or 2k
> > vectors that would be realistic. Perhaps using glove or perhaps using
> > some other software but something that would reflect a true 2k
> > dimensional space accurately with "real" data underneath. I am not
> > familiar enough with the field to tell whether a simple concatenation
> > is a good enough simulation - perhaps it is.
> >
> > I would really prefer to focus on doing this kind of assessment of
> > feasibility/ limitations rather than arguing back and forth. I did my
> > experiment a while ago and I can't really tell whether there have been
> > improvements in the indexing/ merging part - your email contradicts my
> > experience Mike, so I'm a bit intrigued and would like to revisit it.
> > But it'd be ideal to work with real vectors rather than a simulation.
> >
> > Dawid
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to