[
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400194#comment-17400194
]
Adrien Grand commented on LUCENE-10048:
---------------------------------------
I'd be interested in understanding better the downsides of using something like
bfloat16 if any. This is similar to what FeatureField is doing (one minor
difference is that FeatureField cheats a bit by reusing the sign bit since
feature values must be positive, in order to have one more mantissa bit). There
is indeed a loss of precision, but does it actually hurt relevance? To give
perspective, this still retains more precision than the way we encode
normalization factors on a single byte.
One benefit of using term frequencies compared to payloads is that they can be
used for dynamic pruning with BMW, which might be an important feature if these
term-doc scores are used for scoring.
> Bypass total frequency check if field uses custom term frequency
> ----------------------------------------------------------------
>
> Key: LUCENE-10048
> URL: https://issues.apache.org/jira/browse/LUCENE-10048
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Tony Xu
> Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a
> check on per field total token count (i.e. field-length) to ensure we don't
> index too many tokens. This is done by accumulating the token's
> *TermFrequencyAttribute.*
>
> Given that currently Lucene allows custom term frequency attached to each
> token and the usage of the frequency can be pretty wild. It is possible to
> have the following case where the check fails with only a few tokens that
> have large frequencies. Currently Lucene will skip indexing the whole
> document.
> *"foo|<very large number> bar|<very large number>"*
>
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not
> likely to use the similarity for this field. Maybe we can offer a way to
> specify that, too?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]