Re: Query about the GitHub statistics for Lucene

2024-03-06 Thread Chris Hegarty
Hi,

Seems that I’ve fallen into the newbie PMC Chair rabbit hole! ;-) - the 
reporting tool has long standing issues. Maybe they’re fixable, maybe not, but 
it’s possible we don’t necessarily need it now.

> On 5 Mar 2024, at 18:22, Michael McCandless  wrote:
> 
> ...
> @Mike. Would it be possible to add a “Past 3 months” to 
> https://githubsearch.mikemccandless.com/search.py ? Which would be helpful 
> when reporting.
> 
> Good idea!  Done!  
> https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=status%3AOpen&dd=updated%3APast+3+months

Cool. Thanks.

The stats I’m trying to retrieve are for PRs created in the past 3 months. 
GitHub allows me to get that with:
   https://github.com/apache/lucene/pulls?q=is%3Apr+created%3A%3E2023-12-05

, which (when run today) shows:  PRs - 36 Open   163 Closed

Another interesting stat is PRs UPDATED in the past 3 months, e.g.
  https://github.com/apache/lucene/pulls?q=is%3Apr+updated%3A%3E2023-12-05+
   ~355 PRs updated.
   ( which we can also see from Mike’s githubsearch [1])

@Mike is it possible to add “created since” filter?

Another very rough approximation of activity / health is commits, e.g.

  $ git log --pretty='format:%cd' --since='3 months ago' | wc -l
  244
  $ git log --all --pretty='format:%cd' --since='3 months ago' | wc -l
  472

So 472 commits on all branches in the past 3 months.

-Chris

[1] 
https://githubsearch.mikemccandless.com/search.py?chg=du&text=&a1=status&a2=undefined&page=0&searcher=29577&sort=recentlyUpdated&format=list&id=uzz5ht9buk98&dd=status%3AOpen&dd=updated%3APast+3+months&dd=issue_or_pr%3APR&newText=


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Query about the GitHub statistics for Lucene

2024-03-06 Thread Michael McCandless
On Wed, Mar 6, 2024 at 4:41 AM Chris Hegarty 
wrote:

Seems that I’ve fallen into the newbie PMC Chair rabbit hole! ;-) - the
> reporting tool has long standing issues. Maybe they’re fixable, maybe not,
> but it’s possible we don’t necessarily need it now.
>

Sorry :)  Seems to be a rite-of-passage at this point!  It should be
mentioned in the handover instructions... or, we should simply merge Daniel
Gruno's one-line fix to the regexp that Kibble/Whimsy/reporter tool uses:
https://issues.apache.org/jira/browse/COMDEV-425?focusedCommentId=17823767&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17823767

@Mike is it possible to add “created since” filter?
>

Ahh good idea, done!
https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=created%3APast+3+months&dd=issue_or_pr%3APR
(this is PRs created in the Past 3 months ... it shows 36 open and 162
closed right now, close to the GitHub counts you found).

Here's the luceneserver commit that adds it:
https://github.com/mikemccand/luceneserver/commit/397942573bed3e2c4fd00ab0a324a19fd014bfd4

Mike McCandless

http://blog.mikemccandless.com


Re: Query about the GitHub statistics for Lucene

2024-03-06 Thread Chris Hegarty
Hi Mike,

> On 6 Mar 2024, at 10:47, Michael McCandless  wrote:
> 
> On Wed, Mar 6, 2024 at 4:41 AM Chris Hegarty  
> wrote:
> 
> Seems that I’ve fallen into the newbie PMC Chair rabbit hole! ;-) - the 
> reporting tool has long standing issues. Maybe they’re fixable, maybe not, 
> but it’s possible we don’t necessarily need it now.
> 
> Sorry :)  Seems to be a rite-of-passage at this point! 

Ha! Just happy that I’m not alone on this.

> It should be mentioned in the handover instructions... or, we should simply 
> merge Daniel Gruno's one-line fix to the regexp that Kibble/Whimsy/reporter 
> tool uses: 
> https://issues.apache.org/jira/browse/COMDEV-425?focusedCommentId=17823767&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17823767

That would be great, but I’m not sure why it’s not been done before at this 
point. I’ll add a note to future handover instructions if it cannot be resolved.

> @Mike is it possible to add “created since” filter?
> 
> Ahh good idea, done!  
> https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated&dd=created%3APast+3+months&dd=issue_or_pr%3APR
>   (this is PRs created in the Past 3 months ... it shows 36 open and 162 
> closed right now, close to the GitHub counts you found).

This looks right, thanks. I think we can use Githubsearch going forward. :-) 

> Here's the luceneserver commit that adds it: 
> https://github.com/mikemccand/luceneserver/commit/397942573bed3e2c4fd00ab0a324a19fd014bfd4

Thank you,
-Chris.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Query about the GitHub statistics for Lucene

2024-03-06 Thread Uwe Schindler

Hi,

Yes, we should contact INFRA so they get all the repository links 
uptodate. They should maybe send us a list of tracked repos/issue 
trackers for us to review. There were also some crazy things like the 
temporary repository, that we used to migrate our issues from JIRA to 
Github, be used for statistics, but NOT the apache/lucene one.


The statistics for JIRA are clearly wrong, too. The last change in JIRA 
was Aug 19, 2022.


Uwe

Am 05.03.2024 um 14:26 schrieb Robert Muir:

On Tue, Mar 5, 2024 at 4:50 AM Chris Hegarty
 wrote:

It appears that there is no GH activity for 2024! Clearly this is incorrect. 
I’ve yet to track down what’s going on with this. Familiar to anyone here?


Last time I looked at this, it appeared it is looking at the incorrect
github repositories, for example https://github.com/apache/lucene-solr
and not https://github.com/apache/lucene

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene » Lucene-Check-main (s390x big endian) - Build # 460 - Still Failing!

2024-03-06 Thread Uwe Schindler

See this issue: https://github.com/apache/lucene/issues/13161

The s390x server (big endian) has no Java 21 yet. I'll keep the job 
enabled, should work soon.


Uwe

Am 06.03.2024 um 23:09 schrieb Apache Jenkins Server:

Build: 
https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main%20(s390x%20big%20endian)/460/

No tests ran.

Build Log:
[...truncated 29 lines...]
ERROR: JAVA_HOME is set to an invalid directory: 
/home/jenkins/tools/java/latest21

Please set the JAVA_HOME variable in your environment to match the
location of your Java installation.

Build step 'Invoke Gradle script' changed build result to FAILURE
Build step 'Invoke Gradle script' marked build as failure
Archiving artifacts
Recording test results
ERROR: Step ‘Publish JUnit test result report’ failed: No test report files 
were found. Configuration error?
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any

-
To unsubscribe, e-mail: builds-unsubscr...@lucene.apache.org
For additional commands, e-mail: builds-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Inlining, virtual calls and BKDPointsTree

2024-03-06 Thread Gautam Worah
> I'll try tweaking the query set to target queries with more Point hits
during the week and see what comes out..

I tried this and also tried benchmarking the change on 2 other types of
indexes, with slightly varying attributes. They roughly correlate to
indexes for different categories of products.
Performance on both throughput and latency was flat.

The change still LGTM.

Regards,
Gautam Worah.


On Sat, Mar 2, 2024 at 8:45 AM Gautam Worah  wrote:

> > I am running Amazon Product Search's benchmarks to see if the change is
> needle moving for us.
>
> Results were flat to slightly positive (+0.94% redline QPS) on our
> workload.
> Although we do have numeric range queries that would've improved, I
> suspect it is flat because our workload is heavily dominated by TermQueries
> and their combinations with various clauses.
>
> I'll try tweaking the query set to target queries with more Point hits
> during the week and see what comes out..
>
> Regards,
> Gautam Worah.
>
>
> On Sat, Mar 2, 2024 at 2:33 AM Anton Hägerstrand 
> wrote:
>
>> Thank you Gautam!
>>
>> > Yeah, it seems like luceneutil is not stressing the code path that
>> ElasticSearch's benchmarks are?
>>
>> Yes, as far as I understand it - though it might just be that I don't
>> understand luceneutil good enough. I believe that in order to see the
>> performance diff numerical range queries or numerical sorting would have to
>> be involved - the more documents matched the larger the difference. This is
>> what the relevant benchmark operations from Elastic does.
>>
>> > So it seems like switching over from an iterative visit(int docID)
>> call to a bulk visit(DocIdSetIterator iterator) gave us these gains?
>> Cool!
>>
>> Yes, it seems like it, based on this benchmark.
>>
>> > I am running Amazon Product Search's benchmarks to see if the change
>> is needle moving for us.
>>
>> Thank you, much appreciated!
>>
>> > Small suggestion on the blog...
>>
>> Thank you for the feedback! The post is definitely a bit confusing, I
>> struggled with keeping it clear. I will try to make some edits to make it
>> clearer what conclusions can be made after each section.
>>
>> /Anton
>>
>> On Sat, 2 Mar 2024 at 00:30, Gautam Worah  wrote:
>>
>>> Hi Anton,
>>>
>>> It took me a while to get through the blog post, and I suspect I will
>>> need to read through a couple more times to understand it fully.
>>> Thanks for writing up something so detailed. I learnt a lot! (especially
>>> about JVM inlining methods).
>>>
>>> > I have not been able to reproduce the speedup with lucenutil - I
>>> suspect that the default tasks in it would not trigger this code path that
>>> much.
>>>
>>> Yeah, it seems like luceneutil is not stressing the code path that
>>> ElasticSearch's benchmarks are?
>>>
>>> > I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>> decrease of execution time of 35-45% for range queries and numerical
>>> sorting with this patch applied.
>>>
>>> So it seems like switching over from an iterative visit(int docID) call
>>> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>>>
>>> I am running Amazon Product Search's benchmarks to see if the change is
>>> needle moving for us.
>>>
>>> Small suggestion on the blog: The JVM inlining, ElasticSearch
>>> short-circuiting/opto causing a difference in performance could've been a
>>> blog on its own, part 1 maybe.. I got confused when the blog shifted from
>>> the performance differences between ElasticSearch and OpenSearch, to how
>>> you ended up improving Lucene.
>>>
>>> Regards,
>>> Gautam Worah.
>>>
>>>
>>> On Fri, Mar 1, 2024 at 2:42 AM Anton Hägerstrand 
>>> wrote:
>>>
 Hi everyone, long time lurker here.

 I recently investigated Elasticsearch/OpenSearch performance in a blog
 post [1], and saw some interesting behavior of numerical range queries and
 numerical sorting with regards to inlining and virtual calls.

 In short, the DocIdsWriter::readInts method seems to get much slower if
 it is called with 3 or more implementations of IntersectVisitor during the
 JVM lifetime. I believe that this is due to IntersectVisitory.visit(docid)
 being heavily inlined with 2 or fewer IntersectVisitor implementations,
 while becoming a virtual call with 3 or more.

 This leads to two interesting points wrt Lucene

 1) For benchmarks, warm ups should not only be done to trigger speedups
 by the JIT, instead making the JVM be in a realistic production state. For
 the BKDPointTree, this means at least 3 implementations of the
 IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
 benchmarks?
 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
 instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>

Re: Inlining, virtual calls and BKDPointsTree

2024-03-06 Thread Anton Hägerstrand
> I tried this and also tried benchmarking the change on 2 other types of
indexes, with slightly varying attributes. They roughly correlate to
indexes for different categories of products.
> Performance on both throughput and latency was flat.

Thank you very much for running the benchmarks and reviewing the code!

After thinking a bit about this I think that it would be best if the PR
could be proven to improve performance in luceneutil before merging. Since
luceneutil does not currently, as far as I understand things, have good
coverage for point range queries and numerical sorting, this means adding
that functionality to luceneutil. I'll start looking into that next week
(I'm currently travelling).

best regards,
Anton

On Thu, 7 Mar 2024 at 02:30, Gautam Worah  wrote:

> > I'll try tweaking the query set to target queries with more Point hits
> during the week and see what comes out..
>
> I tried this and also tried benchmarking the change on 2 other types of
> indexes, with slightly varying attributes. They roughly correlate to
> indexes for different categories of products.
> Performance on both throughput and latency was flat.
>
> The change still LGTM.
>
> Regards,
> Gautam Worah.
>
>
> On Sat, Mar 2, 2024 at 8:45 AM Gautam Worah 
> wrote:
>
>> > I am running Amazon Product Search's benchmarks to see if the change
>> is needle moving for us.
>>
>> Results were flat to slightly positive (+0.94% redline QPS) on our
>> workload.
>> Although we do have numeric range queries that would've improved, I
>> suspect it is flat because our workload is heavily dominated by TermQueries
>> and their combinations with various clauses.
>>
>> I'll try tweaking the query set to target queries with more Point hits
>> during the week and see what comes out..
>>
>> Regards,
>> Gautam Worah.
>>
>>
>> On Sat, Mar 2, 2024 at 2:33 AM Anton Hägerstrand 
>> wrote:
>>
>>> Thank you Gautam!
>>>
>>> > Yeah, it seems like luceneutil is not stressing the code path that
>>> ElasticSearch's benchmarks are?
>>>
>>> Yes, as far as I understand it - though it might just be that I don't
>>> understand luceneutil good enough. I believe that in order to see the
>>> performance diff numerical range queries or numerical sorting would have to
>>> be involved - the more documents matched the larger the difference. This is
>>> what the relevant benchmark operations from Elastic does.
>>>
>>> > So it seems like switching over from an iterative visit(int docID)
>>> call to a bulk visit(DocIdSetIterator iterator) gave us these gains?
>>> Cool!
>>>
>>> Yes, it seems like it, based on this benchmark.
>>>
>>> > I am running Amazon Product Search's benchmarks to see if the change
>>> is needle moving for us.
>>>
>>> Thank you, much appreciated!
>>>
>>> > Small suggestion on the blog...
>>>
>>> Thank you for the feedback! The post is definitely a bit confusing, I
>>> struggled with keeping it clear. I will try to make some edits to make it
>>> clearer what conclusions can be made after each section.
>>>
>>> /Anton
>>>
>>> On Sat, 2 Mar 2024 at 00:30, Gautam Worah 
>>> wrote:
>>>
 Hi Anton,

 It took me a while to get through the blog post, and I suspect I will
 need to read through a couple more times to understand it fully.
 Thanks for writing up something so detailed. I learnt a lot!
 (especially about JVM inlining methods).

 > I have not been able to reproduce the speedup with lucenutil - I
 suspect that the default tasks in it would not trigger this code path that
 much.

 Yeah, it seems like luceneutil is not stressing the code path that
 ElasticSearch's benchmarks are?

 > I tried changing the DocIdsWriter::readInts32 (and readDelta16),
 instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
 number of virtual calls. In the benchmark setup by Elastic [2] I saw a
 decrease of execution time of 35-45% for range queries and numerical
 sorting with this patch applied.

 So it seems like switching over from an iterative visit(int docID)
 call to a bulk visit(DocIdSetIterator iterator) gave us these gains?
 Cool!

 I am running Amazon Product Search's benchmarks to see if the change is
 needle moving for us.

 Small suggestion on the blog: The JVM inlining, ElasticSearch
 short-circuiting/opto causing a difference in performance could've been a
 blog on its own, part 1 maybe.. I got confused when the blog shifted from
 the performance differences between ElasticSearch and OpenSearch, to how
 you ended up improving Lucene.

 Regards,
 Gautam Worah.


 On Fri, Mar 1, 2024 at 2:42 AM Anton Hägerstrand 
 wrote:

> Hi everyone, long time lurker here.
>
> I recently investigated Elasticsearch/OpenSearch performance in a blog
> post [1], and saw some interesting behavior of numerical range queries and
> numerical sorting with regards to inlining and virtual c