Re: Question about QueryCache

2021-02-27 Thread Haoyu Zhai
Thanks Mike and Adrien for confirming the behavior!
I checked again and debugged the unit case and found it is
IndexSearcher.createWeight will be recursively called when BooleanQuery is
creating weight (
https://github.com/apache/lucene-solr/blob/e88b3e9c204f907fdb41d6d0f40d685574acde97/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L59),
I missed this part when I previously checking the logic.

Best
Patrick

Adrien Grand  于2021年2月26日周五 下午1:02写道:

> It does recurse indeed! To reuse Mike's example, in that case the cache
> would consider caching:
>  - A,
>  - B,
>  - C,
>  - D,
>  - (C D),
>  - +A +B +(C D)
>
> One weakness of this cache is that it doesn't consider caching subsets of
> boolean queries (except single clauses). E.g. in the above example, it
> would never consider caching +A +B even if the conjunction of these two
> clauses occurs in many queries.
>
> Le ven. 26 févr. 2021 à 20:03, Michael McCandless <
> luc...@mikemccandless.com> a écrit :
>
>> Hi Haoyu,
>>
>> I'm pretty sure (but not certain!) that query cache is smart enough to
>> recurse through the full query tree, and consider any of the whole queries
>> it finds during that recursion.
>>
>> So e.g. a query like +A +B +(C D) would consider caching A, B, C D, or
>> the whole original +A +B +(C D) query.
>>
>> But I'm not sure!  Hopefully someone who knows more about query cache
>> might chime in.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Feb 22, 2021 at 8:55 PM Haoyu Zhai  wrote:
>>
>>> Hi folks,
>>> I'm trying to understand how QueryCache works and one question
>>> popped out of my head was that is QueryCache caching
>>> 1. the whole query that being submitted to IndexSearcher or
>>> 2. it will recurse into the query and selectively caching some of the
>>> clauses (especially for BooleanQuery)?
>>>
>>> From my observation it is the former case but I just want to double
>>> check in case I missed anything.
>>>
>>> Thanks
>>> Patrick
>>>
>>


Question about QueryCache

2021-02-22 Thread Haoyu Zhai
Hi folks,
I'm trying to understand how QueryCache works and one question popped out
of my head was that is QueryCache caching
1. the whole query that being submitted to IndexSearcher or
2. it will recurse into the query and selectively caching some of the
clauses (especially for BooleanQuery)?

>From my observation it is the former case but I just want to double check
in case I missed anything.

Thanks
Patrick


Re: [VOTE] Release Lucene/Solr 8.8.0 RC2

2021-01-27 Thread Haoyu Zhai
+1 (non-binding)

Tested Lucene part of RC1 on our service, since that part is not changed so
still +1.

Patrick

Namgyu Kim  于2021年1月27日周三 上午10:26写道:

> +1 (binding)
>
> SUCCESS! [1:30:27.376324]
>
> On Tue, Jan 26, 2021 at 10:19 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> +1 (binding)
>>
>>
>> SUCCESS! [0:43:40.201461]
>>
>>
>> However, the first time I ran smoke tester, it failed with this:
>>
>>[junit4] Tests with failures [seed: D3F97A1F3602195A]:
>>
>>[junit4]   -
>> org.apache.solr.cloud.LeaderTragicEventTest.testLeaderFailsOver
>>
>>
>>[junit4]   2> NOTE: reproduce with: ant test  
>> -Dtestcase=LeaderTragicEventTest
>> -Dtests.method=testLeaderFailsOver -Dtests.seed=D3F97A1F3602195A
>> -Dtests.locale=ar-LB -Dtests.timezone=SystemV/MST7MDT -Dtests.asserts=true
>> -Dtes\
>>
>> ts.file.encoding=US-ASCII
>>
>>[junit4] ERROR   10.9s J1 | LeaderTragicEventTest.testLeaderFailsOver
>> <<<
>>
>>[junit4]> Throwable #1:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
>> from server at https://127.0.0.1:33003/solr: Underlying core creation
>> failed while creating collection: testLeaderFailsO\
>>
>> ver
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:369)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:297)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1171)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:934)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:866)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:214)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:231)
>>
>>[junit4]>at
>> org.apache.solr.cloud.LeaderTragicEventTest.testLeaderFailsOver(LeaderTragicEventTest.java:80)
>>
>>[junit4]>at
>> java.lang.Thread.run(Thread.java:748)Throwable #2:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
>> from server at https://127.0.0.1:33003/solr: Could not find collection :
>> \
>>
>> testLeaderFailsOver
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:369)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:297)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1171)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:934)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:866)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:214)
>>
>>[junit4]>at
>> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:231)
>>
>>[junit4]>at
>> org.apache.solr.cloud.LeaderTragicEventTest.tearDown(LeaderTragicEventTest.java:73)
>>
>>[junit4]>at java.lang.Thread.run(Thread.java:748)
>>
>>
>> I guess it was a transient failure -- I re-ran smoke tester and it passed
>> the 2nd time.  Is this a known Bad Apple test?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Jan 26, 2021 at 4:53 AM Ignacio Vera  wrote:
>>
>>> +1 (binding)
>>>
>>> SUCCESS! [0:53:01.546134]
>>>
>>> On Tue, Jan 26, 2021 at 1:51 AM Tomás Fernández Löbbe <
>>> tomasflo...@gmail.com> wrote:
>>>
 Thanks Noble! And thanks for fixing that concurrency issue, I'd hit it
 but didn't have time to investigate it.

 +1
 SUCCESS! [0:58:32.036482]

 On Mon, Jan 25, 2021 at 10:19 AM Timothy Potter 
 wrote:

> Thanks Noble!
>
> +1 SUCCESS! [1:24:28.212370] (my internet is super slow today)
>
> Re-ran all the Solr 

Re: [VOTE] Release Lucene/Solr 8.8.0 RC1

2021-01-21 Thread Haoyu Zhai
+1 (non-binding)

I've merged branch locally and running our team's (Amazon Search) benchmark
tools. Everything looks good.

David Smiley  于2021年1月21日周四 上午9:30写道:

> +1
> SUCCESS! [1:17:50.702261]
>
> Thanks for your thorough testing Tim :-)
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, Jan 21, 2021 at 10:49 AM Andrzej Białecki  wrote:
>
>> +1 (binding)
>>
>> SUCCESS! [1:31:27.365392]
>>
>>
>> > On 20 Jan 2021, at 08:39, Atri Sharma  wrote:
>> >
>> > +1 (binding)
>> >
>> > SUCCESS! [1:04:15:20393]
>> >
>> > On Wed, Jan 20, 2021 at 1:03 PM Ignacio Vera  wrote:
>> >>
>> >> +1 (binding)
>> >>
>> >> SUCCESS! [1:05:30.358141]
>> >>
>> >>
>> >> On Tue, Jan 19, 2021 at 8:25 PM Timothy Potter 
>> wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> SUCCESS! [1:07:15.796578]
>> >>>
>> >>>
>> >>> Also built a *local* Docker image from the RC and tested various
>> features with the Solr operator on K8s, such as the updates to the Prom
>> exporter & Grafana dashboard for query performance.
>> >>>
>> >>>
>> >>> Looks good!
>> >>>
>> >>>
>> >>> On Tue, Jan 19, 2021 at 12:06 PM Houston Putman <
>> houstonput...@gmail.com> wrote:
>> 
>>  +1
>> 
>>  SUCCESS! [1:01:28.552891]
>> 
>>  On Tue, Jan 19, 2021 at 1:53 PM Cassandra Targett <
>> casstarg...@gmail.com> wrote:
>> >
>> > I’ve put up the DRAFT version of the Ref Guide for 8.8:
>> https://lucene.apache.org/solr/guide/8_8/.
>> >
>> > I also created the Jenkins job for building the 8.8 guide which
>> pushes to the Nightlies server in case we have edits between now and
>> release (https://nightlies.apache.org/Lucene/Solr-reference-guide-8.8/).
>> >
>> > Note branch_8_8 does not (yet) include the new Math Expressions
>> guide being worked on in SOLR-13105. Still hoping that will make it, but
>> thought I’d get this out sooner rather than later just in case.
>> > On Jan 19, 2021, 10:51 AM -0600, Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com>, wrote:
>> >
>> > Please vote for release candidate 1 for Lucene/Solr 8.8.0
>> >
>> > The artifacts can be downloaded from:
>> >
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC1-rev737cb9c49b08f6e3964c1e8a80132da3c764e027
>> >
>> > You can run the smoke tester directly with this command:
>> >
>> > python3 -u dev-tools/scripts/smokeTestRelease.py \
>> >
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC1-rev737cb9c49b08f6e3964c1e8a80132da3c764e027
>> >
>> > The vote will be open for at least 72 hours i.e. until 2021-01-22
>> 17:00 UTC.
>> >
>> > [ ] +1  approve
>> > [ ] +0  no opinion
>> > [ ] -1  disapprove (and reason why)
>> >
>> > Here is my +1
>> > 
>> >
>> > --
>> > Regards,
>> >
>> > Atri
>> > Apache Concerted
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: LeafReaderContext ord is unexpectedly 0

2020-12-27 Thread Haoyu Zhai
Hi Joel,
LeafReader.getContext() is expected to return "the root IndexReaderContext

for
this IndexReader
's
sub-reader tree." (
https://lucene.apache.org/core/5_2_0/core/org/apache/lucene/index/LeafReader.html#getContext()
)
Which means it will returns a context with ord 0 (a newly constructed, not
the previous one [1]) if it is already a leaf. So I think this is expected?

[1]:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L43

Best
Patrick

Joel Bernstein  于2020年12月27日周日 上午8:59写道:

> I ran into this while writing some Solr code today.
>
> List leaves =
> req.getSearcher().getTopReaderContext().leaves();
>
> The req is a SolrQueryRequest object.
>
> Now if I do this:
>
> leaves.get(5).reader().getContext().ord
>
> I would expect *ord* in this scenario to be *5*.
>
> But in my testing in master it's returning 0.
>
> It seems like this is a bug. Not sure yet if this is a bug in Sor or
> Lucene. Am I missing anything here that anyone can see?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>


Re: Deterministic index construction

2020-12-19 Thread Haoyu Zhai
Hi Adrien
I think Mike's comment is correct, we already have index sorted but we want
to reconstruct a index with exact same number of segments and each segment
contains exact same documents.

Mike
AddIndexes could take CodecReader as input [1], which allows us to pass in
a customized FilteredIndexReader I think? Then it knows which docs to take.
And then suppose original index has N segments, we could open N IndexWriter
concurrently and rebuilt those N segments, and at last somehow merge them
back to a whole index. (I am not quite sure about whether we could achieve
the last step easily, but that sounds not so hard?)

[1]
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-

Michael Sokolov  于2020年12月19日周六 上午9:13写道:

> I don't know about addIndexes. Does that let you say which document goes
> where somehow? Wouldn't you have to select a subset of documents from each
> originally indexed segment?
>
> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov  wrote:
>
>> I think the idea is to exert control over the distribution of documents
>> among the segments, in a deterministic reproducible way.
>>
>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand  wrote:
>>
>>> Have you considered leveraging Lucene's built-in index sorting? It
>>> supports concurrent indexing and is quite fast.
>>>
>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai  wrote:
>>>
>>>> Hi
>>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>>> manner but that might be too slow for us sometimes)
>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>>> to be aware of which doc belong to which segment:
>>>> 1. First build index in the normal way (concurrently), after the index
>>>> is built, using "addIndexes" functionality to merge documents into the
>>>> correct segment.
>>>> 2. By controlling FlushPolicy and other related classes, make sure each
>>>> segment created (before merge) has only the documents that belong to one of
>>>> the segments in the pre-built index. And create a dedicated MergePolicy to
>>>> only merge segments belonging to one pre-built segment.
>>>>
>>>> Basically we think first one is easier to implement and second one is
>>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>>
>>>> Thanks
>>>> Patrick Zhai
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>


Deterministic index construction

2020-12-18 Thread Haoyu Zhai
Hi
Our team is seeking a way of construct (or rebuild) a deterministic sorted
index concurrently (I know lucene could achieve that in a sequential manner
but that might be too slow for us sometimes)
Currently we have roughly 2 ideas, all assuming there's a pre-built index
and have dumped a doc-segment map so that IndexWriter would be able to be
aware of which doc belong to which segment:
1. First build index in the normal way (concurrently), after the index is
built, using "addIndexes" functionality to merge documents into the correct
segment.
2. By controlling FlushPolicy and other related classes, make sure each
segment created (before merge) has only the documents that belong to one of
the segments in the pre-built index. And create a dedicated MergePolicy to
only merge segments belonging to one pre-built segment.

Basically we think first one is easier to implement and second one is
faster. Want to seek some ideas & suggestions & feedback here.

Thanks
Patrick Zhai


Re: Question about behaviour of IntervalIterator

2020-11-19 Thread Haoyu Zhai
Thanks Alan,
I've opened an issue:
https://issues.apache.org/jira/browse/LUCENE-9618
And also a PR including a unit test to demonstrate the issue:
https://github.com/apache/lucene-solr/pull/2090
Seems we're not on the exact same point, originally I'm asking about
whether nextInterval() are supposed to be called after NO_MORE_DOCS is
returned. But we definitely should update doc/test for behavior after
NO_MORE_INTERVALS returned as well.

Patrick

Alan Woodward  于2020年11月19日周四 上午1:36写道:

> Some of the minimum-interval algorithms will call nextInterval() or
> start() even after the interval has been exhausted, so we need to handle
> those situations properly.  Improved java doc would definitely be helpful
> though, and maybe we should update checkIntervals() in TestIntervals to
> test what happens when calling nextInterval() after it has returned
> NO_MORE_INTERVALS.  Do you want to open an issue?
>
> - Alan
>
> > On 19 Nov 2020, at 08:17, Haoyu Zhai  wrote:
> >
> > Hi,
> > I'm trying to play around with my own IntervalSource and found out that
> "nextInterval" method of IntervalIterator will be called sometimes even
> after "nextDoc"/"docID"/"advance" method returns NO_MORE_DOCS.
> > After I dug a bit more I found that FilteringIntervalIterator.reset is
> calling an inner iterator's "nextInterval" regardless of what the result of
> "nextDoc", and also most (if not all) existing IntervalIterator's
> implementation do considered the case where "nextInterval" is called after
> "nextDoc" returns NO_MORE_DOCS.
> > I'm a bit confused here since I thought in most places lucene assumes
> undefined behavior after NO_MORE_DOCS are returned for those method should
> be called only after "advance", but for "nextInterval" seems its not the
> case. Should we change the current behavior of "nextInterval"
> implementations or add some caution comment to javadoc?
> >
> > Thanks
> > Patrick Zhai
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Question about behaviour of IntervalIterator

2020-11-19 Thread Haoyu Zhai
Hi,
I'm trying to play around with my own IntervalSource and found out that
"nextInterval" method of IntervalIterator will be called sometimes even
after "nextDoc"/"docID"/"advance" method returns NO_MORE_DOCS.
After I dug a bit more I found that FilteringIntervalIterator.reset is
calling an inner iterator's "nextInterval" regardless of what the result of
"nextDoc", and also most (if not all) existing IntervalIterator's
implementation do considered the case where "nextInterval" is called after
"nextDoc" returns NO_MORE_DOCS.
I'm a bit confused here since I thought in most places lucene assumes
undefined behavior after NO_MORE_DOCS are returned for those method should
be called only after "advance", but for "nextInterval" seems its not the
case. Should we change the current behavior of "nextInterval"
implementations or add some caution comment to javadoc?

Thanks
Patrick Zhai


[jira] [Commented] (LUCENE-8878) Provide alternative sorting utility from SortField other than FieldComparator

2019-07-07 Thread Haoyu Zhai (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879908#comment-16879908
 ] 

Haoyu Zhai commented on LUCENE-8878:


[~rcmuir] If I understand LatLonPointDistanceComparator correctly, `copy` 
method is not optimized, so once we make use of this comparator's inner storage 
(`values` field), we'll always need to incur the full cost (as we'll always 
want to `copy` first to store the values)? And the actual optimization is 
happened before we call `copy` operation, we could make a call to 
`compareBottom` to filter out bad points in a lower cost. So I guess it is not 
necessary to have `values` field to keep the optimization, as `compareBottom` 
is not using `values` anyway?I guess to keep the optimization for 
LatLonPointDistanceComparator, we need to have a `compareBottom` and 
`setBottom` and also related fields, but need not to keep storage of whole sort 
values in the comparator?

Also [~hypothesisx86], I think rather than having comparison logic in 
`SortField`, we could have a comparator class and bind this class with 
`ValueAccessor` to enable easier customization?

> Provide alternative sorting utility from SortField other than FieldComparator
> -
>
> Key: LUCENE-8878
> URL: https://issues.apache.org/jira/browse/LUCENE-8878
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1.1
>Reporter: Tony Xu
>Priority: Major
>
> The `FieldComparator` has many responsibilities and users get all of them at 
> once. At high level the main functionalities of `FieldComparator` are
>  * Provide LeafFieldComparator
>  * Allocate storage for requested number of hits
>  * Read the values from DocValues/Custom source etc.
>  * Compare two values
> There are two major areas for improvement
>  # The logic of reading values and storing them are coupled.
>  # User need to specify the size in order to create a `FieldComparator` but 
> sometimes the size is unknown upfront.
>  # From `FieldComparator`'s API, one can't reason about thread-safety so it 
> is not suitable for concurrent search.
>  E.g. Can two concurrent thread use the same `FieldComparator` to call 
> `getLeafComparator` for two different segments they are working on? In fact, 
> almost all existing implementations of `FieldComparator` are not thread-safe.
> The proposal is to enhance `SortField` with two APIs
>  # {color:#14892c}int compare(Object v1, Object v2){color} – this is to 
> compare two values from different docs for this field
>  # {color:#14892c}ValueAccessor newValueAccessor(LeafReaderContext 
> leaf){color} – This encapsulate the logic for obtaining the right 
> implementation in order to read the field values.
>  `ValueAccessor` should be accessed in a similar way as `DocValues` to 
> provide the sort value for a document in an advance & read fashion.
> With this API, hopefully we can reduce the memory usage when using 
> `FieldComparator` because the users either store the sort values or at least 
> the slot number besides the storage allocated by `FieldComparator` itself. 
> Ideally, only once copy of the values should be stored.
> The proposed API is also more friendly to concurrent search since it provides 
> the `ValueAccessor` per leaf. Although same `ValueAccessor` can't be shared 
> if there are more than one thread working on the same leaf, at least they can 
> initialize their own `ValueAccessor`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org