Can you require the user to specify missing: true or missing: false semantics. With that you can decide what to do with the missing values
On Thu, Nov 9, 2023, 7:55 AM Mikhail Khludnev <m...@apache.org> wrote: > Hello Michael. > This optimization "NOT the less common value" assumes that boolean field > is required, but how to enforce this mandatory field constraint in Lucene? > I'm not aware of something like Solr schema or mapping. > If saying foo:true is common, it means that the posting list goes like > dense sequentially increasing numbers 1,2,3,4,5.. May it already be > compressed by codecs like > https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html > ? > > On Thu, Nov 9, 2023 at 3:31 AM Michael Froh <msf...@gmail.com> wrote: > >> Hey, >> >> I've been musing about ideas for a "clever" Boolean field type on Lucene >> for a while, and I think I might have an idea that could work. That said, >> this popped into my head this afternoon and has not been fully-baked. It >> may not be very clever at all. >> >> My experience is that Boolean fields tend to be overwhelmingly true or >> overwhelmingly false. I've had pretty good luck with using a keyword-style >> field, where the only term represents the more sparse value. (For example, >> I did a thing years ago with explicit tombstones, where versioned deletes >> would have the field "deleted" with a value of "true", and live >> documents didn't have the deleted field at all. Every query would add a >> filter on "NOT deleted:true".) >> >> That's great when you know up-front what the sparse value is going to be. >> Working on OpenSearch, I just created an issue suggesting that we take a >> hint from users for which value they think is going to be more common so we >> only index the less common one: >> https://github.com/opensearch-project/OpenSearch/issues/11143 >> >> At the Lucene level, though, we could index a Boolean field type as the >> less common term when we flush (by counting the values and figuring out >> which is less common). Then, per segment, we can rewrite any query for the >> more common value as NOT the less common value. >> >> You can compute upper/lower bounds on the value frequencies cheaply >> during a merge, so I think you could usually write the doc IDs for the less >> common value directly (without needing to count them first), even when >> input segments disagree on which is the more common value. >> >> If your Boolean field is not overwhelmingly lopsided, you might even want >> to split segments to be 100% true or 100% false, such that queries against >> the Boolean field become match-all or match-none. On a retail website, >> maybe you have some toggle for "only show me results with property X" -- if >> all your property X products are in one segment or a handful of segments, >> you can drop the property X clause from the matching segments and skip the >> other segments. >> >> I guess one icky part of this compared to the usual Lucene field model is >> that I'm assuming a Boolean field is never missing (or I guess missing >> implies "false" by default?). Would that be a deal-breaker? >> >> Thanks, >> Froh >> > > > -- > Sincerely yours > Mikhail Khludnev >