Scorer#getMinScore()

2023-06-09 Thread Marc D'Mello
Hi all,

I was wondering why there is no Scorer#getMinScore() equivalent to
Scorer#getMaxScore() (here
).
I think it could potentially be useful for skipping when you have scoring
functions with a subtraction in it.

As a contrived example, say I wrote a SubtractionAndQuery(Query a, Query b)
that matched a conjunction of a and b but the score was a.score() -
b.score(). When creating a scorer, the best getMaxScore() function I could
create would look like this:

float getMaxScore(int upto) {
return a.getMaxScore(upto);
}

However, this would not give me the tightest upper bound score possible as
I am completely neglecting the "b" term here. Something like this would be
better:

float getMaxScore(int upto) {
return Math.max(a.getMaxScore(upto) - b.getMinScore(upto), 0);
}

So I was wondering if not including this API was by design (the same reason
why Lucene doesn't allow negative scores for queries) or if it was because
the added block level metadata required to store the min term scores would
be too much? I'm sure there's some other issues I could be overlooking as
well.

Any answers would be greatly appreciated!

Thanks,
Marc


Blunders profiling for Searching is broken

2023-03-17 Thread Marc D'Mello
Hi all,

I was looking at some of the profiles on Blunders (which is linked from the
nightly benchmarking site: https://home.apache.org/~mikemccand/lucenebench/)
and it seems like some of the latest Searching profiles are not working.
For example:
https://blunders.io/jfr-demo/searching-2023.03.16.18.02.48/jvm_info. The
indexing profiles seem to be working fine as far as I can tell, so I wonder
if this is a problem with how the nightly benchmarks are publishing data to
the Blunders API.

Thanks,
Marc


Re: Request for naming help

2022-12-29 Thread Marc D'Mello
Hi Greg,

I'm also OK merging as is since this is a new feature and doesn't affect
any of the current functionality. I also think there are no glaring issues
with the API in its current state. However, I do think that merging the
range and rangeonrange functionality makes sense and I like Adrien's
suggestion of providing factory methods. I think if we merge in its current
state we should create a new issue to refactor the range and
rangeonrange faceting package into one and follow the RangeFieldQuery model
more closely.

On Thu, Dec 29, 2022 at 2:58 PM Greg Miller  wrote:

> Hey Marc-
>
> I don't want to speak for Adrien as he might have something different in
> mind, but I think that's more-or-less the idea. I'm not sure the factory
> methods belong on the LongRange/DoubleRange classes, or if separate classes
> should be created for this purpose (which is more how I thought of it)?
>
> To do this cleanly though, I'd really like us to try to consolidate all
> the "range related" faceting functionality into one java package and
> consolidate the API a bit. As part of this, I think we can be a little
> smarter about not duplicating the "range" classes themselves.
>
> All this said, given that I think your "range on range" faceting PR is
> ready to be merged as it currently exists, and has been through a number of
> iteration already, I'm OK if we want to merge that work as it stands and
> follow up with revisiting the API/naming/etc. as a future project. What do
> you think?
>
> Cheers,
> -Greg
>
> On Tue, Dec 13, 2022 at 7:23 PM Marc D'Mello  wrote:
>
>> Hi,
>>
>> I'm a bit unsure about what is being suggested. Is the idea to rename
>> range#LongRange and rangeonrange#LongRange to LongFieldFacets and
>> LongRangeFacets respectively and stick the static getters in there? In that
>> case, I also think that the idea makes a lot of sense and that it would
>> match our current range query API much better.
>>
>> In addition, looking at document#LongRange, there are queries like
>> newContainsQuery() and newWithinQuery() that we can probably mimic to
>> avoid exposing RangeFieldQuery.QueryType to the user.
>>
>> On Tue, Dec 13, 2022 at 5:04 PM Greg Miller  wrote:
>>
>>> Thanks for the suggestion Adrien. I like this idea! Marc- what do you
>>> think?
>>>
>>> We might need to rework the package structure under the facets module to
>>> make this clean, but that might not be a terrible thing anyway. The
>>> existing sub-packages will make it challenging to get the visibility right.
>>> I think it would be ideal to flatten the package so we can reduce
>>> visibility of the class definitions and only expose the factory methods.
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Tue, Dec 13, 2022 at 01:18 Adrien Grand  wrote:
>>>
>>>> I wonder if the facets actually require a different name, since they
>>>> look to me like a generalization of range facets for range fields,
>>>> while we previously only supported range facets on numeric fields. We
>>>> could keep calling them range facets?
>>>>
>>>> Maybe we could use the same model we used for queries by not exposing
>>>> query classes to users and providing factory methods, e.g. we could
>>>> have something like:
>>>>
>>>> public class LongFieldFacets {
>>>>
>>>>   public static Facets getRangeFacetCounts(String field,
>>>> FacetsCollector hits, LongRange... ranges) {
>>>> return new LongRangeFacetCounts(...);
>>>>   }
>>>>
>>>> }
>>>>
>>>> public class LongRangeFacets {
>>>>
>>>>   // same function name
>>>>   public static Facets getRangeFacetCounts(String field,
>>>> FacetsCollector hits, RangeFieldQuery.QueryType queryType,
>>>> LongRange... ranges) {
>>>> return new LongRangeOnRangeFacetCounts(...);
>>>>   }
>>>>
>>>> }
>>>>
>>>> We'd still need to give a name for these classes, but the name would
>>>> be less important since these class names would be only for ourselves.
>>>> Users would never see them and refer to this new functionality as
>>>> range facets on range fields?
>>>>
>>>> On Mon, Dec 12, 2022 at 10:11 PM Gus Heck  wrote:
>>>> >
>>>> > In that case, maybe "Range Logic Faceting" ?
>>>> >
>>>> > Relation seems too broad a

Re: Request for naming help

2022-12-13 Thread Marc D'Mello
Hi,

I'm a bit unsure about what is being suggested. Is the idea to rename
range#LongRange and rangeonrange#LongRange to LongFieldFacets and
LongRangeFacets respectively and stick the static getters in there? In that
case, I also think that the idea makes a lot of sense and that it would
match our current range query API much better.

In addition, looking at document#LongRange, there are queries like
newContainsQuery() and newWithinQuery() that we can probably mimic to
avoid exposing RangeFieldQuery.QueryType to the user.

On Tue, Dec 13, 2022 at 5:04 PM Greg Miller  wrote:

> Thanks for the suggestion Adrien. I like this idea! Marc- what do you
> think?
>
> We might need to rework the package structure under the facets module to
> make this clean, but that might not be a terrible thing anyway. The
> existing sub-packages will make it challenging to get the visibility right.
> I think it would be ideal to flatten the package so we can reduce
> visibility of the class definitions and only expose the factory methods.
>
> Cheers,
> -Greg
>
> On Tue, Dec 13, 2022 at 01:18 Adrien Grand  wrote:
>
>> I wonder if the facets actually require a different name, since they
>> look to me like a generalization of range facets for range fields,
>> while we previously only supported range facets on numeric fields. We
>> could keep calling them range facets?
>>
>> Maybe we could use the same model we used for queries by not exposing
>> query classes to users and providing factory methods, e.g. we could
>> have something like:
>>
>> public class LongFieldFacets {
>>
>>   public static Facets getRangeFacetCounts(String field,
>> FacetsCollector hits, LongRange... ranges) {
>> return new LongRangeFacetCounts(...);
>>   }
>>
>> }
>>
>> public class LongRangeFacets {
>>
>>   // same function name
>>   public static Facets getRangeFacetCounts(String field,
>> FacetsCollector hits, RangeFieldQuery.QueryType queryType,
>> LongRange... ranges) {
>> return new LongRangeOnRangeFacetCounts(...);
>>   }
>>
>> }
>>
>> We'd still need to give a name for these classes, but the name would
>> be less important since these class names would be only for ourselves.
>> Users would never see them and refer to this new functionality as
>> range facets on range fields?
>>
>> On Mon, Dec 12, 2022 at 10:11 PM Gus Heck  wrote:
>> >
>> > In that case, maybe "Range Logic Faceting" ?
>> >
>> > Relation seems too broad and too overloaded elsewhere, makes me think
>> of RDBMS, related-ness, joins and such via word associations.
>> >
>> > On Mon, Dec 12, 2022 at 3:27 PM Greg Miller  wrote:
>> >>
>> >> Thank for the suggestion! I like the descriptiveness of it. My only
>> hesitation is that is supports more than range intersection based on the
>> provided QueryType instance (e.g., within, contains). I _imagine_ that
>> intersection will be most common, but I don’t really know of course. I
>> thought about generalizing your suggestion to something like “Range
>> Relation Faceting,” but fear that would be confusing.
>> >>
>> >> Thanks again!
>> >>
>> >> Cheers,
>> >> -Greg
>> >>
>> >> On Mon, Dec 12, 2022 at 10:19 Gus Heck  wrote:
>> >>>
>> >>> Maybe "Range Intersect Faceting"?
>> >>>
>> >>> On Mon, Dec 12, 2022 at 1:11 PM Greg Miller 
>> wrote:
>> >>>>
>> >>>> Folks-
>> >>>>
>> >>>> Naming is hard! (But you all know that already).
>> >>>>
>> >>>> Marc D'Mello and I have been working on a new faceting
>> implementation that's meant to complement Lucene's existing range-relation
>> queries (e.g., LongRange#newIntersectsQuery, DoubleRange#newContainsQuery,
>> LongRangeDocValuesField#newSlowIntersectsQuery, etc.). Well, I should say
>> Marc is working on the change and I'm just providing nit-picky feedback on
>> his PR, which is here: https://github.com/apache/lucene/pull/11901. The
>> general idea of this feature is to allow users to get facet counts for
>> these sorts of range-relation filters before they're applied. For example,
>> if a user is indexing ranges with their documents, they may have a set of
>> query-ranges they want to facet on, based on some range relationship (e.g.,
>> intersection, contains, etc.).
>> >>>>
>> >>>> As a concrete example, imagine that documents contain a p

Re: Adding a new PointDocValuesField

2022-05-25 Thread Marc D'Mello
Read your example again and yes, that makes sense. I was only thinking in
terms of single dimensions, my bad!

On Wed, May 25, 2022 at 11:08 AM Greg Miller  wrote:

> I appreciate all the feedback, but disagree that we can accomplish what
> we’re trying to do here with the existing fields.
>
> It’s not sufficient to AND together multiple fields for this use-case
> because of the fact that the different dimensions can be multi-valued and
> not all combinations are valid. To go back to my example, imagine wiper
> blades that fit 2010 Ford vehicles and 2011 Chevy vehicles but not 2010
> Chevy or 2011 Ford. You have to index the combinations, not the separate
> component values. I can’t see a way to retain this information with
> separate fields. Am I missing something? I guess with an “unsorted” numeric
> DV type we could get there with aligned indices, as you describe, but that
> seems less appealing than supporting multi-dim points directly.
>
> I’m in agreement though that there isn’t a compelling need to add a new
> field type for this. I have no problem building on BDV and putting this in
> the sandbox module to start. Makes sense to me. It sounds like we’d have
> consensus to take that approach and re-evaluate if there are future needs?
> Any objections?
>
> Cheers,
> -g
>
>
> On Wed, May 25, 2022 at 10:05 Marc D'Mello  wrote:
>
>> But adding a new type should be the last resort.
>>
>>
>> I did not realize that was the case, that's good to know. It seems like I
>> should just use BDV (which does make the code change easier/faster so I
>> have no issues with it).
>>
>> As for Patrick's suggestion of using separate numeric fields instead of
>> packing them together, that actually does sound like an interesting idea, I
>> think the biggest issue with it though would be implementing a multivalued
>> version of this. As Robert pointed out, we would need an UnsortedNumericDV.
>>
>> Thanks for all the feedback!
>>
>>
>> On Wed, May 25, 2022 at 8:17 AM Robert Muir  wrote:
>>
>>> On Wed, May 25, 2022 at 12:17 AM Greg Miller  wrote:
>>> >
>>> >  A "two separate field approach" would
>>> > consist of indexing year and make separately, and you'd lose the
>>> > information that only certain combinations are valid. Am I overlooking
>>> > something with your suggestion? Maybe there's something we can do with
>>> > Lucene already that solves for this case and I'm just not aware of it?
>>> > That's entirely possible and I'd love to learn more if there is!
>>>
>>> This makes no sense to me. If there are two dimensions, there's no
>>> difference in faceting code calling fieldA.value and fieldB.value,
>>> than calling field.valueA and field.valueB.
>>>
>>> In other words, doesn't make any sense to needlessly "pack dimensions
>>> together" at docvalues level, especially for what should be a
>>> column-stride field. There's really no difference from the app
>>> perspective. Any issues you have here seem to be issues around facet
>>> module and not docvalues...
>>>
>>> >
>>> > As for MultiRangeQuery and the mention of sandbox modules, I think
>>> > that's a bit of a different use-case. MultiRangeQuery lets you filter
>>> > by a disjunction of ranges. The "multi" part doesn't relate to
>>> > "multiple values in a doc" (but it does support that, as do the
>>> > "standard" range queries).
>>> >
>>> > Where I see a gap right now, beyond just faceting, is that we can
>>> > represent N-dim points in the points index and filter on them (using
>>> > the points index), but we have no doc values equivalent. This means,
>>> > 1) we can't facet, and 2) we can't create a "slow" query that does
>>> > post-filtering instead of using the points index (which could be a
>>> > very real advantage in cases with a sparse match set but a dense
>>> > points index). So I like the idea of creating that concept and being
>>> > able to facet and filter on it. Whether-or-not this is a "formal" doc
>>> > values type or sits on top of BDV, I have less of a strong opinion.
>>>
>>> We shouldn't add new docvalues types because of "slow queries", I'm
>>> really against that. The root problem is that points impl can't filter
>>> well (like the inverted index can), and as a hack, docvalues "picks up
&

Re: Adding a new PointDocValuesField

2022-05-25 Thread Marc D'Mello
>
> But adding a new type should be the last resort.


I did not realize that was the case, that's good to know. It seems like I
should just use BDV (which does make the code change easier/faster so I
have no issues with it).

As for Patrick's suggestion of using separate numeric fields instead of
packing them together, that actually does sound like an interesting idea, I
think the biggest issue with it though would be implementing a multivalued
version of this. As Robert pointed out, we would need an UnsortedNumericDV.

Thanks for all the feedback!


On Wed, May 25, 2022 at 8:17 AM Robert Muir  wrote:

> On Wed, May 25, 2022 at 12:17 AM Greg Miller  wrote:
> >
> >  A "two separate field approach" would
> > consist of indexing year and make separately, and you'd lose the
> > information that only certain combinations are valid. Am I overlooking
> > something with your suggestion? Maybe there's something we can do with
> > Lucene already that solves for this case and I'm just not aware of it?
> > That's entirely possible and I'd love to learn more if there is!
>
> This makes no sense to me. If there are two dimensions, there's no
> difference in faceting code calling fieldA.value and fieldB.value,
> than calling field.valueA and field.valueB.
>
> In other words, doesn't make any sense to needlessly "pack dimensions
> together" at docvalues level, especially for what should be a
> column-stride field. There's really no difference from the app
> perspective. Any issues you have here seem to be issues around facet
> module and not docvalues...
>
> >
> > As for MultiRangeQuery and the mention of sandbox modules, I think
> > that's a bit of a different use-case. MultiRangeQuery lets you filter
> > by a disjunction of ranges. The "multi" part doesn't relate to
> > "multiple values in a doc" (but it does support that, as do the
> > "standard" range queries).
> >
> > Where I see a gap right now, beyond just faceting, is that we can
> > represent N-dim points in the points index and filter on them (using
> > the points index), but we have no doc values equivalent. This means,
> > 1) we can't facet, and 2) we can't create a "slow" query that does
> > post-filtering instead of using the points index (which could be a
> > very real advantage in cases with a sparse match set but a dense
> > points index). So I like the idea of creating that concept and being
> > able to facet and filter on it. Whether-or-not this is a "formal" doc
> > values type or sits on top of BDV, I have less of a strong opinion.
>
> We shouldn't add new docvalues types because of "slow queries", I'm
> really against that. The root problem is that points impl can't filter
> well (like the inverted index can), and as a hack, docvalues "picks up
> the slack". If its becoming a major issue, address this with points
> directly?
>
> >
> > And finally... it really should be multi-valued. The points index
> > supports multiple points-per-field within a single document. Seems
> > like a big gap that we wouldn't support that with a doc value field.
> > Because BDV is inherently single-valued, I propose we come up with an
> > encoding scheme that encodes multiple points on top of that "single"
> > BDV entry. This is where building on BDV started to feel a little icky
> > to me and it seemed like it might be a good use-case for actually
> > formalizing a format/encoding, but again, no strong preference. We
> > could certainly do something more quickly on top of BDV and formalize
> > an encoding later if/as necessary.
>
> Doesn't matter that points index supports it. Do the use-cases make
> sense? It's especially stupid that e.g. LatLonDocValueField supports
> multi-values. Really? What kind of quantum documents are in multiple
> locations at the same time?
>
> The sortedset/sortednumeric exist to support use-cases on String and
> int, where user wants to "sort on a multivalued field", which is
> really crazy if you think about it. So they both sort the numbers at
> index-time, so that you can pick a "representative" value
> (min/max/median) in constant time. I think a lot of this existing
> stuff is just brain-damage from the no-sql fads, alternatively we
> could remove this multivalued nonsense and the crazy servers that want
> to follow no-sql fads could index just the "representative value"
> (min/max/median) in a single-valued field.
>
> Sorry, I'm just not seeing a lot of strong use-cases here to justify
> creating a new DV field, which we should really avoid, as its a hugely
> expensive cost. I would recommend prototyping stuff with
> BinaryDocValues, using the sandbox, etc. See if the features get
> popular and people use them.
>
> If they really "catch on", and we think its more efficient, then we
> can think about how the stuff could be best encoded/compressed/etc.
> But adding a new type should be the last resort. Adding some
> specialized multi-dimensional type is IMO out of the question. It
> would be a lot less horrible to just use separate DV fi

Re: Adding a new PointDocValuesField

2022-05-24 Thread Marc D'Mello
Hi,

Thanks for the responses! For Patrick's question, right now in faceting we
don't have any good way to AND between two fields. I think the original
hyper rectangle issue has a good example of a use case:
https://issues.apache.org/jira/browse/LUCENE-10274.

As for Robert's point, this feature would also allow us to use
MultiRangeQuery in IndexOrDocValuesQuery, but MultiRangeQuery is itself in
the sandbox module so I'm assuming that's a pretty exotic use case as well.
I personally have no issues using BinaryDocValues for this, I was just
wondering if it would be better to create a dedicated doc values, but it
seems that is not that case.

Thanks,
Marc

On Tue, May 24, 2022 at 1:27 AM Robert Muir  wrote:

> This seems really exotic feature to add a dedicated docvalues field for.
>
> We should let BINARY be the catchall for stuff like this.
>
> On Mon, May 23, 2022 at 10:17 PM Marc D'Mello  wrote:
> >
> > Hi,
> >
> > Some background: I've been working on this PR to add hyper rectangle
> faceting capabilities to Lucene facets and I needed to create a new doc
> values field to support this feature. Initially, I had a field that just
> extended BinaryDocValues, but then a discussion came up about whether to
> add a completely new DocValues field, maybe something like
> PointDocValuesField (and SortedPointDocValuesField as the multivalued
> version) to add first class support for this new field. Here is the link to
> the discussion. I think there are a few benefits to this:
> >
> > Formalize how we would store points as doc values rather than just
> packing points into a BinaryDocValues field in a format that could change
> at any time
> > NumericDocValues enables us to create a SortedNumericDocValuesRange
> query which can be used with IndexOrDocValuesQuery to make some range
> queries more efficient. Adding this new doc values field would let us do
> the same thing with higher dimensional ranges
> >
> > I'm sure I could be missing some benefits, and I also am not super
> experienced with Lucene so there could be drawbacks I am missing as well
> :). From what I understand though, Lucene doesn't have a lot of DocValues
> fields and there should be some thought put into adding new ones, so I was
> wondering if I could get some feedback about the idea. Thanks!
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Adding a new PointDocValuesField

2022-05-23 Thread Marc D'Mello
Hi,

Some background: I've been working on this PR to add hyper
rectangle faceting  capabilities
to Lucene facets and I needed to create a new doc values field to support
this feature. Initially, I had a field that just extended BinaryDocValues,
but then a discussion came up about whether to add a completely new
DocValues field, maybe something like PointDocValuesField (and
SortedPointDocValuesField as the multivalued version) to add first class
support for this new field. Here is the link to the discussion
. I think
there are a few benefits to this:

   - Formalize how we would store points as doc values rather than just
   packing points into a BinaryDocValues field in a format that could change
   at any time
   - NumericDocValues enables us to create a SortedNumericDocValuesRange
   query which can be used with IndexOrDocValuesQuery to make some range
   queries more efficient. Adding this new doc values field would let us do
   the same thing with higher dimensional ranges

I'm sure I could be missing some benefits, and I also am not super
experienced with Lucene so there could be drawbacks I am missing as well
:). From what I understand though, Lucene doesn't have a lot of DocValues
fields and there should be some thought put into adding new ones, so I was
wondering if I could get some feedback about the idea. Thanks!