Hi Doug, Many thanks for the tons of useful information!
Some comments/questions inline below. — Ken > On Oct 19, 2018, at 10:46 AM, Doug Turnbull > <dturnb...@opensourceconnections.com> wrote: > > This is a pretty big hole in Lucene-based search right now that many > practitioners have struggled with > > I know a couple of people who have worked on solutions. And I've used a > couple of hacks: > > - You can hack together something that does cosine similarity using the > term frequency & query boosts DelimitedTermFreqFilterFactory. Basically the > term frequency becomes a feature weight on the document. Boosts become the > query weight. If you massage things correctly with the similarity, the > resulting boolean similarity is a dot product… I’ve done a quick test of that approach, though not as elegantly. I just constructed a string of “terms” (feature indices) that generated an approximation to the target vector. DelimitedTermFreqFilterFactory is much better :) The problem I ran into was that some features have negative weights, and it wasn’t obvious whether it would work to have a second field (with only the negative weights) that I used for (not really supported in Solr?) negative boosting. Is there some hack to work around that? > - Erik Hatcher has done some great work with payloads which you might want > to check out. See the delimited payload filter factory, and payload score > function queries Thanks, I’d poked at payloads a bit. From what I could tell, there isn't a way to use payloads with negative feature values, or to filter results, but maybe I didn’t dig deep enough. > - Simon Hughes Activate Talk (slides/video not yet posted) covers this > topic in some depth OK, that looks great - https://activate2018.sched.com/event/FkM3 and https://github.com/DiceTechJobs/VectorsInSearch Seems like the planets are aligning for this kind of thing. > - Rene Kriegler's Haystack Talk discusses encoding Inception model > vectorizations of images: > https://opensourceconnections.com/events/haystack-single/haystack-relevance-scoring/ Good stuff, thanks! I’d be curious what his querqy <https://github.com/renekrie/querqy> configuration looked like for the “summing up fieldweights only (ignore df; use cross-field tf)” row in his results table on slide 36. The use of LSHs (what he describes in this talk as “random projection forest") is something I’d suggested to the client, to mitigate the need for true feature vector support. Using an initial LSH-based query to get candidates, and then re-ranking based on the actual feature vector, is something I was expecting Rene to discuss but he didn’t seem to mention it in his talk. > If this is a huge importance to you, I might also suggest looking at vespa, > which makes tensors a first-class citizen and makes matrix-math pretty > seamless: http://vespa.ai Interesting, though my client is pretty much locked into using Solr. > On Fri, Oct 19, 2018 at 12:50 PM Ken Krugler <kkrugler_li...@transpac.com> > wrote: > >> Hi all, >> >> [I posted on the Lucene list two days ago, but didn’t see any response - >> checking here for completeness] >> >> I’ve been looking at directly storing feature vectors and providing >> scoring/filtering support. >> >> This is for vectors consisting of (typically 300 - 2048) floats or doubles. >> >> It’s following the same pattern as geospatial support - so a new field >> type and query/parser, plus plumbing to hook it into Solr. >> >> Before I go much further, is there anything like this already done, or in >> the works? >> >> Thanks, >> >> — Ken >> > CTO, OpenSource Connections > Author, Relevant Search > http://o19s.com/doug -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra