Re: Two-phase range queries?

2021-06-29 Thread David Smiley
Greg, you may find it interesting to see some code in spatial-extras that
works similar to what you describe. There is a so-called
CompositeSpatialStrategy built off a grid using the terms index (not BKD)
plus another that stores the vector geometry into DocValues.  The Query
that does the work is here:
https://github.com/apache/lucene/blob/main/lucene/spatial-extras/src/java/org/apache/lucene/spatial/composite/IntersectsRPTVerifyQuery.java
Only some "shapes" that are at the border of the query shape will require
the potentially expensive DocValues lookup.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jun 29, 2021 at 9:30 AM Greg Miller  wrote:

> Hi folks-
>
> I've been spending a little time getting familiar with the BKD-tree-based
> range query support currently implemented in Lucene, and wonder if there's
> ever been a discussion around supporting two-phase iteration in this space.
> If I'm understanding the current implementation properly (specifically
> looking at PointRangeQuery), it appears that all matches are determined
> up-front by 1) identifying segments of the tree that contain candidate
> matches (i.e., containing part of the query range), and then 2) confirming
> whether-or-not the contained points actually fall within the range. I'm
> also a little low on coffee this morning so it's entirely possible I'm
> misunderstanding the current implementation (please correct me if so).
>
> With this approach, it seems like we could potentially be doing quite a
> bit of wasted effort in some situations. I have no thoughts on how to
> actually implement this yet, but I wonder if we could support two-phase
> iteration by 1) returning all docs with points contained in candidate
> BKD-tree segments as an approximation, and then 2) only checking the points
> against the query range when confirming matches in the second phase? I
> think the idea would extend to LatLonPointDistanceQuery as well (and maybe
> others?).
>
> I did a Jira search for a related issue but came up empty. Anyone know if
> this idea has been discussed previously, or if there's some inherent flaw
> with the approach that would make it a non-starter? I don't really have any
> cycles to work on this at the moment, but can at least open a Jira issue to
> track if it seems like a reasonable thing to explore.
>
> Cheers,
> -Greg
>


Re: 9.0 release

2021-06-29 Thread David Smiley
There are also deprecations to remove:
https://issues.apache.org/jira/browse/LUCENE-8638

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Jun 29, 2021 at 2:43 PM Mike Drob  wrote:

> Looks like just LUCENE-9334 remains?
>
> On Wed, Apr 14, 2021 at 10:18 PM Julie Tibshirani 
> wrote:
> >
> > Hello everyone! I will pick up LUCENE-9908.
> >
> >
> > I had marked LUCENE-9583 as a blocker but I'm on board with removing its
> blocker status given we can make changes later. I hope to come back to the
> issue soon with some ideas.
> >
> >
> > Julie
> >
> >
> > On Wed, Apr 14, 2021 at 12:42 PM Adrien Grand  wrote:
> >>
> >> I agree that we can remove the blocker status from LUCENE-9583 and take
> advantage of the fact that these new APIs are experimental to improve
> things later.
> >>
> >> For the renaming issue, maybe we could just make vectors plural for now
> for consistency and revisit other options later.
> >>
> >> On Wed, Apr 14, 2021 at 8:21 PM Michael Sokolov 
> wrote:
> >>>
> >>> Thanks Adrien; I plan to tackle LUCENE-9905.
> >>>
> >>>  I don't have ideas about how to move forward on LUCENE-9583; I spent
> >>> significant amount of time trying various permutations on that API,
> >>> and what we have was the best compromise I could find at the time, so
> >>> I'm not sure I agree it's a Blocker, yet I'm open to improvements.
> >>> Maybe Julie will propose something?
> >>>
> >>> There is also a vector-related renaming issue Tomoko had opened, which
> >>> I thought was marked Blocker, but I guess no longer is. Previously I
> >>> had hoped to get some strong consensus, but that proved challenging.
> >>> Given that, I'm OK leaving things as-is, marking these apis
> >>> @experimental and potentially revisiting naming issues later, eg once
> >>> we have a second vector ANN implementation.
> >>>
> >>> On Wed, Apr 14, 2021 at 11:07 AM Adrien Grand 
> wrote:
> >>> >
> >>> > Hi Mike,
> >>> >
> >>> > Here's what I know about the remaining blockers:
> >>> >
> >>> > LUCENE-9908 - Move VectorValues#search to VectorReader and LeafReader
> >>> > This was discussed on the mailing list and it looks like there was
> agreement on making that change. If someone has cycles and can take it,
> please go ahead, otherwise I'll try to allocate some time to it. I'm
> expecting this change to be rather straightforward.
> >>> >
> >>> > LUCENE-9905 - Revise approach to specifying NN algorithm
> >>> > This is a change to how we've been thinking about configuring the
> ANN algorithm. I don't know if someone plans to work on it.
> >>> >
> >>> > LUCENE-9583 - How should we expose VectorValues.RandomAccess
> >>> > We'd like to get rid of this sub interface, but I'm not the best
> person to comment on how much work this needs. Maybe Mike S or Julie can
> give more info.
> >>> >
> >>> > LUCENE-9334 - Require consistency between data-structures on a
> per-field basis
> >>> > Mayya has been working on this one and it's very close.
> >>> >
> >>> > LUCENE-9047 - Directory APIs should be little endian
> >>> > Ignacio and Julie have been working on this one and it is close as
> well.
> >>> >
> >>> >
> >>> > On Tue, Apr 13, 2021 at 10:59 PM Mike Drob  wrote:
> >>> >>
> >>> >> Michael, did you get a chance to mark the issues you were thinking
> of as blockers?
> >>> >>
> >>> >> Adrien, I see that the remaining open blockers look mostly like
> your open issues. Two of them have recent activity, but LUCENE-9047 would
> need to be brought back to the lucene repo. Is this an accurate view of the
> state of things?
> >>> >>
> >>> >> Now that I'm done with 8.8.2, I would love to see how we can
> continue to make headway on 9.0!
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Mon, Mar 29, 2021 at 3:25 PM Michael Sokolov 
> wrote:
> >>> >>>
> >>> >>> There has been some discussion around a few code visibility and
> naming
> >>> >>> issues related to "VectorFormat" as it is called today. I'd like to
> >>> >>> get that sorted out before 9.0 - I'll hunt up the ticket(s) and
> mark
> >>> >>> as blockers
> >>> >>>
> >>> >>> On Sun, Mar 28, 2021 at 11:02 AM Adrien Grand 
> wrote:
> >>> >>> >
> >>> >>> > Hello Jan,
> >>> >>> >
> >>> >>> > The list of blockers should be mostly up-to-date:
> https://issues.apache.org/jira/browse/LUCENE-9661?jql=project%3D%22Lucene%20-%20Core%22%20and%20priority%3DBlocker%20and%20fixVersion%3D%22main%20(9.0)%22
> .
> >>> >>> >
> >>> >>> > On Sat, Mar 27, 2021 at 7:21 PM Jan Høydahl <
> jan@cominvent.com> wrote:
> >>> >>> >>
> >>> >>> >> Hi,
> >>> >>> >>
> >>> >>> >> Where are we at with the Lucene 9.0 release planning?
> >>> >>> >>
> >>> >>> >> The git split is largely done. Not sure about the build.
> >>> >>> >> Let's update the umbrella issue
> https://issues.apache.org/jira/browse/LUCENE-9375 for known remaining
> cleanup tasks.
> >>> >>> >> The one on that list is releaseWizard, but as Adrien says there
> are also other scripts that need updating.
> >>> >>> >>
> >>> >>> >> Jan
> >>> 

Re: 9.0 release

2021-06-29 Thread Mike Drob
Looks like just LUCENE-9334 remains?

On Wed, Apr 14, 2021 at 10:18 PM Julie Tibshirani  wrote:
>
> Hello everyone! I will pick up LUCENE-9908.
>
>
> I had marked LUCENE-9583 as a blocker but I'm on board with removing its 
> blocker status given we can make changes later. I hope to come back to the 
> issue soon with some ideas.
>
>
> Julie
>
>
> On Wed, Apr 14, 2021 at 12:42 PM Adrien Grand  wrote:
>>
>> I agree that we can remove the blocker status from LUCENE-9583 and take 
>> advantage of the fact that these new APIs are experimental to improve things 
>> later.
>>
>> For the renaming issue, maybe we could just make vectors plural for now for 
>> consistency and revisit other options later.
>>
>> On Wed, Apr 14, 2021 at 8:21 PM Michael Sokolov  wrote:
>>>
>>> Thanks Adrien; I plan to tackle LUCENE-9905.
>>>
>>>  I don't have ideas about how to move forward on LUCENE-9583; I spent
>>> significant amount of time trying various permutations on that API,
>>> and what we have was the best compromise I could find at the time, so
>>> I'm not sure I agree it's a Blocker, yet I'm open to improvements.
>>> Maybe Julie will propose something?
>>>
>>> There is also a vector-related renaming issue Tomoko had opened, which
>>> I thought was marked Blocker, but I guess no longer is. Previously I
>>> had hoped to get some strong consensus, but that proved challenging.
>>> Given that, I'm OK leaving things as-is, marking these apis
>>> @experimental and potentially revisiting naming issues later, eg once
>>> we have a second vector ANN implementation.
>>>
>>> On Wed, Apr 14, 2021 at 11:07 AM Adrien Grand  wrote:
>>> >
>>> > Hi Mike,
>>> >
>>> > Here's what I know about the remaining blockers:
>>> >
>>> > LUCENE-9908 - Move VectorValues#search to VectorReader and LeafReader
>>> > This was discussed on the mailing list and it looks like there was 
>>> > agreement on making that change. If someone has cycles and can take it, 
>>> > please go ahead, otherwise I'll try to allocate some time to it. I'm 
>>> > expecting this change to be rather straightforward.
>>> >
>>> > LUCENE-9905 - Revise approach to specifying NN algorithm
>>> > This is a change to how we've been thinking about configuring the ANN 
>>> > algorithm. I don't know if someone plans to work on it.
>>> >
>>> > LUCENE-9583 - How should we expose VectorValues.RandomAccess
>>> > We'd like to get rid of this sub interface, but I'm not the best person 
>>> > to comment on how much work this needs. Maybe Mike S or Julie can give 
>>> > more info.
>>> >
>>> > LUCENE-9334 - Require consistency between data-structures on a per-field 
>>> > basis
>>> > Mayya has been working on this one and it's very close.
>>> >
>>> > LUCENE-9047 - Directory APIs should be little endian
>>> > Ignacio and Julie have been working on this one and it is close as well.
>>> >
>>> >
>>> > On Tue, Apr 13, 2021 at 10:59 PM Mike Drob  wrote:
>>> >>
>>> >> Michael, did you get a chance to mark the issues you were thinking of as 
>>> >> blockers?
>>> >>
>>> >> Adrien, I see that the remaining open blockers look mostly like your 
>>> >> open issues. Two of them have recent activity, but LUCENE-9047 would 
>>> >> need to be brought back to the lucene repo. Is this an accurate view of 
>>> >> the state of things?
>>> >>
>>> >> Now that I'm done with 8.8.2, I would love to see how we can continue to 
>>> >> make headway on 9.0!
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Mar 29, 2021 at 3:25 PM Michael Sokolov  
>>> >> wrote:
>>> >>>
>>> >>> There has been some discussion around a few code visibility and naming
>>> >>> issues related to "VectorFormat" as it is called today. I'd like to
>>> >>> get that sorted out before 9.0 - I'll hunt up the ticket(s) and mark
>>> >>> as blockers
>>> >>>
>>> >>> On Sun, Mar 28, 2021 at 11:02 AM Adrien Grand  wrote:
>>> >>> >
>>> >>> > Hello Jan,
>>> >>> >
>>> >>> > The list of blockers should be mostly up-to-date: 
>>> >>> > https://issues.apache.org/jira/browse/LUCENE-9661?jql=project%3D%22Lucene%20-%20Core%22%20and%20priority%3DBlocker%20and%20fixVersion%3D%22main%20(9.0)%22.
>>> >>> >
>>> >>> > On Sat, Mar 27, 2021 at 7:21 PM Jan Høydahl  
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> Where are we at with the Lucene 9.0 release planning?
>>> >>> >>
>>> >>> >> The git split is largely done. Not sure about the build.
>>> >>> >> Let's update the umbrella issue 
>>> >>> >> https://issues.apache.org/jira/browse/LUCENE-9375 for known 
>>> >>> >> remaining cleanup tasks.
>>> >>> >> The one on that list is releaseWizard, but as Adrien says there are 
>>> >>> >> also other scripts that need updating.
>>> >>> >>
>>> >>> >> Jan
>>> >>> >>
>>> >>> >> 13. jan. 2021 kl. 15:10 skrev Adrien Grand :
>>> >>> >>
>>> >>> >> +1 to start planning 9.0.
>>> >>> >>
>>> >>> >> Since you mentioned the Gradle build, I believe that we still need 
>>> >>> >> to migrate some of the release tooling from Ant to Gradle, e.g. 
>>> >>> >> dev-tools/scripts/addBackcompatIndexes.py. 

Could not run benchmark for custom codec in Lucene 9

2021-06-29 Thread praveen nishchal
Hi Members,

We are able to build and run benchmark using various algo file but when we
want to test custom codec like SimpleTextCodec, we are encountering issue
as mentioned below:

Command ran

gradle run -PtaskAlg=conf/indexing-flush-by-RAM-multithreaded.alg
-PmaxHeapSize=5G

Error

> Task :lucene:benchmark:compileJava FAILED
/home/intel/Documents/lucene/lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/CreateIndexTask.java:30:
error: package org.apache.lucene.codecs.simpletext does not exist
import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
  ^
/home/intel/Documents/lucene/lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/CreateIndexTask.java:200:
error: cannot find symbol
iwConf.setCodec(new SimpleTextCodec());
^
  symbol:   class SimpleTextCodec
  location: class CreateIndexTask
2 errors

Kindly help.

Thanks,
Praveen Nishchal


Re: Two-phase range queries?

2021-06-29 Thread Daniel Penning
Hi

There might be a way to improve this using a little bit of extra meta data
for docvalues. For some of the compression strategies used for docvalues
there is already some information that could be used to skip blocks of
documents if no value would match a queries range. By storing the minimum
and maximum for blocks of documents in the docvalue metadata it should be
possible to implement an approximation query that should be able to improve
most cases where a docvalue query is currently used. Postgres has the brin
(block range index) which is similar to this idea and gives in most cases
good performance with extremely small indices.

Am Di., 29. Juni 2021 um 16:39 Uhr schrieb Adrien Grand :

> Hi Greg,
>
> Have you looked at IndexOrDocValuesQuery? It dynamically chooses between
> computing the range up-front using the BKD tree and running the range query
> using doc values depending on the estimated cost of the range query
> (computed by counting the number of leaf nodes of the BKD tree that have
> matches, which is cheap to compute) vs. the cost of the overall query.
> Hopefully the javadocs are not too bad, and I wrote a small blog
> 
> about this query a few years ago.
>
>
> On Tue, Jun 29, 2021 at 3:30 PM Greg Miller  wrote:
>
>> Hi folks-
>>
>> I've been spending a little time getting familiar with the BKD-tree-based
>> range query support currently implemented in Lucene, and wonder if there's
>> ever been a discussion around supporting two-phase iteration in this space.
>> If I'm understanding the current implementation properly (specifically
>> looking at PointRangeQuery), it appears that all matches are determined
>> up-front by 1) identifying segments of the tree that contain candidate
>> matches (i.e., containing part of the query range), and then 2) confirming
>> whether-or-not the contained points actually fall within the range. I'm
>> also a little low on coffee this morning so it's entirely possible I'm
>> misunderstanding the current implementation (please correct me if so).
>>
>> With this approach, it seems like we could potentially be doing quite a
>> bit of wasted effort in some situations. I have no thoughts on how to
>> actually implement this yet, but I wonder if we could support two-phase
>> iteration by 1) returning all docs with points contained in candidate
>> BKD-tree segments as an approximation, and then 2) only checking the points
>> against the query range when confirming matches in the second phase? I
>> think the idea would extend to LatLonPointDistanceQuery as well (and maybe
>> others?).
>>
>> I did a Jira search for a related issue but came up empty. Anyone know if
>> this idea has been discussed previously, or if there's some inherent flaw
>> with the approach that would make it a non-starter? I don't really have any
>> cycles to work on this at the moment, but can at least open a Jira issue to
>> track if it seems like a reasonable thing to explore.
>>
>> Cheers,
>> -Greg
>>
>
>
> --
> Adrien
>


-- 
--

*Daniel Penning* / Senior Product Developer
T.: +49 (0)30 5900113-83 / F.: +49 (0)30 5900113-99
E-Mail: d.penn...@stroeermediabrands.de
Web: www.stroeermediabrands.de

Ströer Media Brands GmbH, Torstraße 49, 10119 Berlin-Mitte
Geschäftsführer: Tim Nocken, Marc Schmitz
Handelsregister: Amtsgericht Charlottenburg HRB 195991 B


Re: Two-phase range queries?

2021-06-29 Thread Robert Muir
On Tue, Jun 29, 2021 at 1:17 PM Greg Miller  wrote:

> Robert- Just to clarify a little bit (slightly more caffeinated now): For the 
> cases where an indexed rectangle fits entirely within the query rectangle, 
> these would be "confirmed" matches up-front, relying on the efficiency of the 
> BKD-tree data structure (as you describe). The second phase confirmation I 
> had in mind were the cases where the query rectangle overlaps, or sits inside 
> the indexed rectangle, and the individual points/docs within the indexed 
> rectangle need to be checked against the boundaries of the query rectangle. 
> That step seemed like a good candidate for a second phase, effectively 
> avoiding that check if it doesn't end up being necessary. But, if that is 
> relatively rare compared to the case where an indexed shape fits entirely 
> within the query shape, then maybe there isn't value.

It is rare/contained by definition, because it is a space-partitioned
data structure. If there were a "lot" of these then BKD would form
rectangles out of them and we'd check those rectangles instead.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Two-phase range queries?

2021-06-29 Thread Greg Miller
Thanks Robert/Adrien for the quick thoughts!

Robert- Just to clarify a little bit (slightly more caffeinated now): For
the cases where an indexed rectangle fits entirely within the query
rectangle, these would be "confirmed" matches up-front, relying on the
efficiency of the BKD-tree data structure (as you describe). The second
phase confirmation I had in mind were the cases where the query
rectangle overlaps,
or sits inside the indexed rectangle, and the individual points/docs within
the indexed rectangle need to be checked against the boundaries of the
query rectangle. That step seemed like a good candidate for a second phase,
effectively avoiding that check if it doesn't end up being necessary. But,
if that is relatively rare compared to the case where an indexed shape fits
entirely within the query shape, then maybe there isn't value.

Adrien- Oh cool! Was not aware of this actually, but after reading the blog
and looking at the code a little bit, this is in the neighborhood of what I
was thinking. I suppose what I had in mind is a little bit of a hybrid of
the two approaches currently implemented. I wonder if, in the case where
the range query is used to "verify" matches (to borrow from your
terminology), there's a benefit to still using the BKD-tree to quickly
eliminate docs that don't match before using the doc values to confirm
matches for "approximate" matches. Said differently, what if there was a
range query implementation that collected "definite" matches as well as
"potential" matches using the BKD-tree up-front (so definite matches are
docs where the indexed shape is entirely contained in the query shape and
potential matches are cases where the indexed shape overlaps the query
shape), then in a second phase used the doc values to confirm/reject
"potential" matches? Just a thought.

Unwinding a bit, these responses are exactly what I was looking for. As a
starting point, I wanted to understand if this had been explored, and it
looks like it certainly has—so thanks for all the pointers!

Cheers,
-Greg

On Tue, Jun 29, 2021 at 7:40 AM Adrien Grand  wrote:

> Hi Greg,
>
> Have you looked at IndexOrDocValuesQuery? It dynamically chooses between
> computing the range up-front using the BKD tree and running the range query
> using doc values depending on the estimated cost of the range query
> (computed by counting the number of leaf nodes of the BKD tree that have
> matches, which is cheap to compute) vs. the cost of the overall query.
> Hopefully the javadocs are not too bad, and I wrote a small blog
> 
> about this query a few years ago.
>
>
> On Tue, Jun 29, 2021 at 3:30 PM Greg Miller  wrote:
>
>> Hi folks-
>>
>> I've been spending a little time getting familiar with the BKD-tree-based
>> range query support currently implemented in Lucene, and wonder if there's
>> ever been a discussion around supporting two-phase iteration in this space.
>> If I'm understanding the current implementation properly (specifically
>> looking at PointRangeQuery), it appears that all matches are determined
>> up-front by 1) identifying segments of the tree that contain candidate
>> matches (i.e., containing part of the query range), and then 2) confirming
>> whether-or-not the contained points actually fall within the range. I'm
>> also a little low on coffee this morning so it's entirely possible I'm
>> misunderstanding the current implementation (please correct me if so).
>>
>> With this approach, it seems like we could potentially be doing quite a
>> bit of wasted effort in some situations. I have no thoughts on how to
>> actually implement this yet, but I wonder if we could support two-phase
>> iteration by 1) returning all docs with points contained in candidate
>> BKD-tree segments as an approximation, and then 2) only checking the points
>> against the query range when confirming matches in the second phase? I
>> think the idea would extend to LatLonPointDistanceQuery as well (and maybe
>> others?).
>>
>> I did a Jira search for a related issue but came up empty. Anyone know if
>> this idea has been discussed previously, or if there's some inherent flaw
>> with the approach that would make it a non-starter? I don't really have any
>> cycles to work on this at the moment, but can at least open a Jira issue to
>> track if it seems like a reasonable thing to explore.
>>
>> Cheers,
>> -Greg
>>
>
>
> --
> Adrien
>


Re: Two-phase range queries?

2021-06-29 Thread Adrien Grand
Hi Greg,

Have you looked at IndexOrDocValuesQuery? It dynamically chooses between
computing the range up-front using the BKD tree and running the range query
using doc values depending on the estimated cost of the range query
(computed by counting the number of leaf nodes of the BKD tree that have
matches, which is cheap to compute) vs. the cost of the overall query.
Hopefully the javadocs are not too bad, and I wrote a small blog

about this query a few years ago.


On Tue, Jun 29, 2021 at 3:30 PM Greg Miller  wrote:

> Hi folks-
>
> I've been spending a little time getting familiar with the BKD-tree-based
> range query support currently implemented in Lucene, and wonder if there's
> ever been a discussion around supporting two-phase iteration in this space.
> If I'm understanding the current implementation properly (specifically
> looking at PointRangeQuery), it appears that all matches are determined
> up-front by 1) identifying segments of the tree that contain candidate
> matches (i.e., containing part of the query range), and then 2) confirming
> whether-or-not the contained points actually fall within the range. I'm
> also a little low on coffee this morning so it's entirely possible I'm
> misunderstanding the current implementation (please correct me if so).
>
> With this approach, it seems like we could potentially be doing quite a
> bit of wasted effort in some situations. I have no thoughts on how to
> actually implement this yet, but I wonder if we could support two-phase
> iteration by 1) returning all docs with points contained in candidate
> BKD-tree segments as an approximation, and then 2) only checking the points
> against the query range when confirming matches in the second phase? I
> think the idea would extend to LatLonPointDistanceQuery as well (and maybe
> others?).
>
> I did a Jira search for a related issue but came up empty. Anyone know if
> this idea has been discussed previously, or if there's some inherent flaw
> with the approach that would make it a non-starter? I don't really have any
> cycles to work on this at the moment, but can at least open a Jira issue to
> track if it seems like a reasonable thing to explore.
>
> Cheers,
> -Greg
>


-- 
Adrien


Re: Two-phase range queries?

2021-06-29 Thread Robert Muir
Greg, I don't think this works the way that you are imagining. It
doesn't go document-at-a-time actually, and I don't think it is a good
fit for two-phases.

It also does not do this two-part confirmation. It reads rectangles
from the index and determines if a rectangle fits in the range.

For example, if i query on Virginia, USA it might iterate like this:

phase 1
* descend deeper into the USA rectangle, as it intersects the query range
* discard the Germany rectangle, as it is completely outside of query range.

phase 2
* discard entire western USA rectangle, outside of query range
* descend deeper into eastern USA rectangle

...

phase N
* accept some county-sized rectangles that fit entirely inside
virginia, they are inside query range
* descend deeper into some county-sized rectangles on the border with
north carolina, west virginia, maryland, etc

It might even have to test some individual points at the end of the
day to sort out the stuff on the "border" with the other states. But
the number of individual points tested should not be large, as the BKD
organizes itself based on density.

Try this video that Mike made, as I am short on coffee too:
https://www.youtube.com/watch?v=8BCf9Kx-AxY

On Tue, Jun 29, 2021 at 9:30 AM Greg Miller  wrote:
>
> Hi folks-
>
> I've been spending a little time getting familiar with the BKD-tree-based 
> range query support currently implemented in Lucene, and wonder if there's 
> ever been a discussion around supporting two-phase iteration in this space. 
> If I'm understanding the current implementation properly (specifically 
> looking at PointRangeQuery), it appears that all matches are determined 
> up-front by 1) identifying segments of the tree that contain candidate 
> matches (i.e., containing part of the query range), and then 2) confirming 
> whether-or-not the contained points actually fall within the range. I'm also 
> a little low on coffee this morning so it's entirely possible I'm 
> misunderstanding the current implementation (please correct me if so).
>
> With this approach, it seems like we could potentially be doing quite a bit 
> of wasted effort in some situations. I have no thoughts on how to actually 
> implement this yet, but I wonder if we could support two-phase iteration by 
> 1) returning all docs with points contained in candidate BKD-tree segments as 
> an approximation, and then 2) only checking the points against the query 
> range when confirming matches in the second phase? I think the idea would 
> extend to LatLonPointDistanceQuery as well (and maybe others?).
>
> I did a Jira search for a related issue but came up empty. Anyone know if 
> this idea has been discussed previously, or if there's some inherent flaw 
> with the approach that would make it a non-starter? I don't really have any 
> cycles to work on this at the moment, but can at least open a Jira issue to 
> track if it seems like a reasonable thing to explore.
>
> Cheers,
> -Greg

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Two-phase range queries?

2021-06-29 Thread Greg Miller
Hi folks-

I've been spending a little time getting familiar with the BKD-tree-based
range query support currently implemented in Lucene, and wonder if there's
ever been a discussion around supporting two-phase iteration in this space.
If I'm understanding the current implementation properly (specifically
looking at PointRangeQuery), it appears that all matches are determined
up-front by 1) identifying segments of the tree that contain candidate
matches (i.e., containing part of the query range), and then 2) confirming
whether-or-not the contained points actually fall within the range. I'm
also a little low on coffee this morning so it's entirely possible I'm
misunderstanding the current implementation (please correct me if so).

With this approach, it seems like we could potentially be doing quite a bit
of wasted effort in some situations. I have no thoughts on how to actually
implement this yet, but I wonder if we could support two-phase iteration by
1) returning all docs with points contained in candidate BKD-tree segments
as an approximation, and then 2) only checking the points against the query
range when confirming matches in the second phase? I think the idea would
extend to LatLonPointDistanceQuery as well (and maybe others?).

I did a Jira search for a related issue but came up empty. Anyone know if
this idea has been discussed previously, or if there's some inherent flaw
with the approach that would make it a non-starter? I don't really have any
cycles to work on this at the moment, but can at least open a Jira issue to
track if it seems like a reasonable thing to explore.

Cheers,
-Greg


Re: Welcome Mayya Sharipova to the Lucene PMC

2021-06-29 Thread Jan Høydahl
Welcome Mayya!

Jan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Mayya Sharipova to the Lucene PMC

2021-06-29 Thread Alan Woodward
Welcome Mayya!

> On 29 Jun 2021, at 02:22, Koji Sekiguchi  wrote:
> 
> Congrats and welcome Mayya!
> 
> Koji
> 
> On 2021/06/28 22:16, Robert Muir wrote:
>> I am pleased to announce that Mayya has accepted an invitation to join
>> the Lucene PMC!
>> Congratulations, and welcome aboard!
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org