Oh interesting! I did not know about this FeatureField (link was to the old repo, now gone: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/FeatureField.java worked for me)
On Wed, Nov 11, 2020 at 4:37 PM Mayya Sharipova <[email protected]> wrote: > > For sparse vectors, we found that Lucene's FeatureField could also be useful. > It stores features as terms and feature values as term frequencies, and > provides several convenient functions to calculate scores based on feature > values. > > On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless > <[email protected]> wrote: >> >> Also, be aware that recent Lucene versions enabled compression for >> BinaryDocValues fields, which might hurt performance of your second solution. >> >> This compression is not yet something you can easily turn off, but there are >> ongoing discussions/PRs about how to make it more easily configurable for >> applications that really care more about search CPU cost over index size for >> BinaryDocValues fields: https://issues.apache.org/jira/browse/LUCENE-9378 >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless >> <[email protected]> wrote: >>> >>> In addition to payloads having kinda of high-ish overhead (slow down >>> indexing, do not compress very well I think, and slow down search as you >>> must pull positions), they are also sort of a forced fit for your use case, >>> right? Because a payload in Lucene is per-term-position, whereas you >>> really need this vector per-term (irrespective of the positions where that >>> term occurs in each document)? >>> >>> Your second solution is an intriguing one. So you would use Lucene's >>> custom term frequencies to store indices into that per-document map encoded >>> into a BinaryDocValues field? During indexing I guess you would need a >>> TokenFilter that hands out these indices in order (0, 1, 2, ...) based on >>> the unique terms it sees, and after all tokens are done, it exports a >>> byte[] serialized map? Hmm, except term frequency 0 is not allowed, so >>> you'd need to + 1 to all indices. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> >>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <[email protected]> >>> wrote: >>>> >>>> Hi Ankur, >>>> Indeed payloads are the standard way to solve this problem. For light >>>> queries with a few top N results that should be efficient. For multi-term >>>> queries that could become penalizing if you need to access the payloads of >>>> too many terms. >>>> Also, there is an experimental PostingsFormat called >>>> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that >>>> would allow you to effectively share the overlapping terms in the index >>>> while having 50 fields. This would solve the index bloat issue, but would >>>> not fully solve the seeks issue. You might want to benchmark this approach >>>> too. >>>> >>>> Bruno >>>> >>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <[email protected]> a écrit : >>>>> >>>>> Hi Lucene Devs, >>>>> I have a need to store a sparse feature vector on a per term >>>>> basis. The total number of possible dimensions are small (~50) and known >>>>> at indexing time. The feature values will be used in scoring along with >>>>> corpus statistics. It looks like payloads were created for this exact >>>>> same purpose but some workaround is needed to minimize the performance >>>>> penalty as mentioned on the wiki . >>>>> >>>>> An alternative is to override term frequency to be a pointer in a >>>>> Map<pointer, Feature_Vector> serialized and stored in BinaryDocValues. At >>>>> query time, the matching docId will be used to advance the pointer to the >>>>> starting offset of this map. The term frequency will be used to perform >>>>> lookup into the serialized map to retrieve the Feature_Vector. That's my >>>>> current plan but I haven't benchmarked it. >>>>> >>>>> The problem that I am trying to solve is to reduce the index bloat and >>>>> eliminate unnecessary seeks as currently these ~50 dimensions are stored >>>>> as separate fields in the index with very high term overlap and Lucene >>>>> does not share Terms dictionary across different fields. This itself can >>>>> be a new feature for Lucene but will reqiure lots of work I imagine. >>>>> >>>>> Any ideas are welcome :-) >>>>> >>>>> Thanks >>>>> -Ankur --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
