Can you require the user to specify missing: true or missing: false
semantics. With that you can decide what to do with the missing values

On Thu, Nov 9, 2023, 7:55 AM Mikhail Khludnev <m...@apache.org> wrote:

> Hello Michael.
> This optimization "NOT the less common value" assumes that boolean field
> is required, but how to enforce this mandatory field constraint in Lucene?
> I'm not aware of something like Solr schema or mapping.
> If saying foo:true is common, it means that the posting list goes like
> dense sequentially increasing numbers 1,2,3,4,5.. May it already be
> compressed by codecs like
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
> ?
>
> On Thu, Nov 9, 2023 at 3:31 AM Michael Froh <msf...@gmail.com> wrote:
>
>> Hey,
>>
>> I've been musing about ideas for a "clever" Boolean field type on Lucene
>> for a while, and I think I might have an idea that could work. That said,
>> this popped into my head this afternoon and has not been fully-baked. It
>> may not be very clever at all.
>>
>> My experience is that Boolean fields tend to be overwhelmingly true or
>> overwhelmingly false. I've had pretty good luck with using a keyword-style
>> field, where the only term represents the more sparse value. (For example,
>> I did a thing years ago with explicit tombstones, where versioned deletes
>> would have the field "deleted" with a value of "true", and live
>> documents didn't have the deleted field at all. Every query would add a
>> filter on "NOT deleted:true".)
>>
>> That's great when you know up-front what the sparse value is going to be.
>> Working on OpenSearch, I just created an issue suggesting that we take a
>> hint from users for which value they think is going to be more common so we
>> only index the less common one:
>> https://github.com/opensearch-project/OpenSearch/issues/11143
>>
>> At the Lucene level, though, we could index a Boolean field type as the
>> less common term when we flush (by counting the values and figuring out
>> which is less common). Then, per segment, we can rewrite any query for the
>> more common value as NOT the less common value.
>>
>> You can compute upper/lower bounds on the value frequencies cheaply
>> during a merge, so I think you could usually write the doc IDs for the less
>> common value directly (without needing to count them first), even when
>> input segments disagree on which is the more common value.
>>
>> If your Boolean field is not overwhelmingly lopsided, you might even want
>> to split segments to be 100% true or 100% false, such that queries against
>> the Boolean field become match-all or match-none. On a retail website,
>> maybe you have some toggle for "only show me results with property X" -- if
>> all your property X products are in one segment or a handful of segments,
>> you can drop the property X clause from the matching segments and skip the
>> other segments.
>>
>> I guess one icky part of this compared to the usual Lucene field model is
>> that I'm assuming a Boolean field is never missing (or I guess missing
>> implies "false" by default?). Would that be a deal-breaker?
>>
>> Thanks,
>> Froh
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Reply via email to