Oh interesting! I did not know about this FeatureField (link was to
the old repo, now gone:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/FeatureField.java
worked for me)

On Wed, Nov 11, 2020 at 4:37 PM Mayya Sharipova
<[email protected]> wrote:
>
> For sparse vectors, we found that Lucene's FeatureField could also be useful. 
> It stores features as terms and feature values as term frequencies, and 
> provides several convenient functions to calculate scores based on feature 
> values.
>
> On Fri, Nov 6, 2020 at 11:16 AM Michael McCandless 
> <[email protected]> wrote:
>>
>> Also, be aware that recent Lucene versions enabled compression for 
>> BinaryDocValues fields, which might hurt performance of your second solution.
>>
>> This compression is not yet something you can easily turn off, but there are 
>> ongoing discussions/PRs about how to make it more easily configurable for 
>> applications that really care more about search CPU cost over index size for 
>> BinaryDocValues fields: https://issues.apache.org/jira/browse/LUCENE-9378
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Nov 6, 2020 at 10:21 AM Michael McCandless 
>> <[email protected]> wrote:
>>>
>>> In addition to payloads having kinda of high-ish overhead (slow down 
>>> indexing, do not compress very well I think, and slow down search as you 
>>> must pull positions), they are also sort of a forced fit for your use case, 
>>> right?  Because a payload in Lucene is per-term-position, whereas you 
>>> really need this vector per-term (irrespective of the positions where that 
>>> term occurs in each document)?
>>>
>>> Your second solution is an intriguing one.  So you would use Lucene's 
>>> custom term frequencies to store indices into that per-document map encoded 
>>> into a BinaryDocValues field?  During indexing I guess you would need a 
>>> TokenFilter that hands out these indices in order (0, 1, 2, ...) based on 
>>> the unique terms it sees, and after all tokens are done, it exports a 
>>> byte[] serialized map?  Hmm, except term frequency 0 is not allowed, so 
>>> you'd need to + 1 to all indices.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Mon, Oct 26, 2020 at 6:16 AM Bruno Roustant <[email protected]> 
>>> wrote:
>>>>
>>>> Hi Ankur,
>>>> Indeed payloads are the standard way to solve this problem. For light 
>>>> queries with a few top N results that should be efficient. For multi-term 
>>>> queries that could become penalizing if you need to access the payloads of 
>>>> too many terms.
>>>> Also, there is an experimental PostingsFormat called 
>>>> SharedTermsUniformSplit (class named STUniformSplitPostingsFormat) that 
>>>> would allow you to effectively share the overlapping terms in the index 
>>>> while having 50 fields. This would solve the index bloat issue, but would 
>>>> not fully solve the seeks issue. You might want to benchmark this approach 
>>>> too.
>>>>
>>>> Bruno
>>>>
>>>> Le ven. 23 oct. 2020 à 02:48, Ankur Goel <[email protected]> a écrit :
>>>>>
>>>>> Hi Lucene Devs,
>>>>>            I have a need to store a sparse feature vector on a per term 
>>>>> basis. The total number of possible dimensions are small (~50) and known 
>>>>> at indexing time. The feature values will be used in scoring along with 
>>>>> corpus statistics. It looks like payloads were created for this exact 
>>>>> same purpose but some workaround is needed to minimize the performance 
>>>>> penalty as mentioned on the wiki .
>>>>>
>>>>> An alternative is to override term frequency to be a pointer in a 
>>>>> Map<pointer, Feature_Vector> serialized and stored in BinaryDocValues. At 
>>>>> query time, the matching docId will be used to advance the pointer to the 
>>>>> starting offset of this map. The term frequency will be used to perform 
>>>>> lookup into the serialized map to retrieve the Feature_Vector. That's my 
>>>>> current plan but I haven't benchmarked it.
>>>>>
>>>>> The problem that I am trying to solve is to reduce the index bloat and 
>>>>> eliminate unnecessary seeks as currently these ~50 dimensions are stored 
>>>>> as separate fields in the index with very high term overlap and Lucene 
>>>>> does not share Terms dictionary across different fields. This itself can 
>>>>> be a new feature for Lucene but will reqiure lots of work I imagine.
>>>>>
>>>>> Any ideas are welcome :-)
>>>>>
>>>>> Thanks
>>>>> -Ankur

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to