Re: Blunders profiling for Searching is broken

2023-04-04 Thread Michael McCandless
Actually, I spoke too soon.  The NIGHTLY_LOG_DIR is indeed a bit different
-- this is where the nightly benchy writes/reads all past nightly results,
generates charts from, etc.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 4, 2023 at 11:50 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Apr 4, 2023 at 11:14 AM Anton Hägerstrand 
> wrote:
>
>> Thanks Mike - looks promising! I will have a look at the next upload.
>>
>
> Yay.
>
> Out of curiosity - what is the reason for using constants.NIGHTLY_LOG_DIR
>> instead of constants.LOGS_DIR, which seems to be what competition.py uses
>> to write the files? Is it that they are always the same value for the
>> nightly builds anyway?
>>
>
> Hmm good question ;)  It really should be constants.LOGS_DIR -- that is
> indeed more general.  I'll fix.  (It is indeed the same value for nightly
> benchy).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>>


Re: Blunders profiling for Searching is broken

2023-04-04 Thread Michael McCandless
On Tue, Apr 4, 2023 at 11:14 AM Anton Hägerstrand  wrote:

> Thanks Mike - looks promising! I will have a look at the next upload.
>

Yay.

Out of curiosity - what is the reason for using constants.NIGHTLY_LOG_DIR
> instead of constants.LOGS_DIR, which seems to be what competition.py uses
> to write the files? Is it that they are always the same value for the
> nightly builds anyway?
>

Hmm good question ;)  It really should be constants.LOGS_DIR -- that is
indeed more general.  I'll fix.  (It is indeed the same value for nightly
benchy).

Mike McCandless

http://blog.mikemccandless.com

>


Re: Blunders profiling for Searching is broken

2023-04-04 Thread Anton Hägerstrand
Thanks Mike - looks promising! I will have a look at the next upload.

Out of curiosity - what is the reason for using constants.NIGHTLY_LOG_DIR
instead of constants.LOGS_DIR, which seems to be what competition.py uses
to write the files? Is it that they are always the same value for the
nightly builds anyway?

thanks,
Anton

On Tue, 4 Apr 2023 at 16:48, Michael McCandless 
wrote:

> OK I attempted a fix:
> https://github.com/mikemccand/luceneutil/commit/2c8ccdf53e93622761a545c1a54377514c338caa
>
> I think this broke at some point when we moved where the JFR files are
> written...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Apr 4, 2023 at 10:37 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm, I'll try to figure out why the nightly benchy is uploading such
>> degenerate JFR zip files!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Mar 18, 2023 at 5:16 AM Anton Hägerstrand 
>> wrote:
>>
>>> I've had a look now - it seems like the .jfr.gz files uploaded for these
>>> search benchmarks are empty (they are well formed gzip files, there is just
>>> no compressed data in them). Creation of these .jfr.gz files happens as
>>> part of the benchmark setup, where I don't (to my knowledge) have access to
>>> any logs.
>>>
>>> Mike McCandless is probably the person best suited to dig into this, but
>>> I'm here for any questions and will happily help debug the issue as much as
>>> I can.
>>>
>>> Blunders should also do a better job at saying "no data found" instead
>>> of throwing an error as well. I will look into this.
>>>
>>> Thank you
>>> Anton
>>>
>>> On Fri, 17 Mar 2023 at 20:17, Anton Hägerstrand 
>>> wrote:
>>>
 Hi! Anton from Blunders here.

 I will take a look as soon as possible, most likely I will be able to
 tell what's going on from server logs. Thank you for reporting - I will put
 up better monitoring in the future.

 /Anton


 On Fri, 17 Mar 2023, 19:49 Marc D'Mello,  wrote:

> Hi all,
>
> I was looking at some of the profiles on Blunders (which is linked
> from the nightly benchmarking site:
> https://home.apache.org/~mikemccand/lucenebench/) and it seems like
> some of the latest Searching profiles are not working. For example:
> https://blunders.io/jfr-demo/searching-2023.03.16.18.02.48/jvm_info.
> The indexing profiles seem to be working fine as far as I can tell, so I
> wonder if this is a problem with how the nightly benchmarks are
> publishing data to the Blunders API.
>
> Thanks,
> Marc
>



Re: Blunders profiling for Searching is broken

2023-04-04 Thread Michael McCandless
OK I attempted a fix:
https://github.com/mikemccand/luceneutil/commit/2c8ccdf53e93622761a545c1a54377514c338caa

I think this broke at some point when we moved where the JFR files are
written...

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 4, 2023 at 10:37 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm, I'll try to figure out why the nightly benchy is uploading such
> degenerate JFR zip files!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Mar 18, 2023 at 5:16 AM Anton Hägerstrand 
> wrote:
>
>> I've had a look now - it seems like the .jfr.gz files uploaded for these
>> search benchmarks are empty (they are well formed gzip files, there is just
>> no compressed data in them). Creation of these .jfr.gz files happens as
>> part of the benchmark setup, where I don't (to my knowledge) have access to
>> any logs.
>>
>> Mike McCandless is probably the person best suited to dig into this, but
>> I'm here for any questions and will happily help debug the issue as much as
>> I can.
>>
>> Blunders should also do a better job at saying "no data found" instead of
>> throwing an error as well. I will look into this.
>>
>> Thank you
>> Anton
>>
>> On Fri, 17 Mar 2023 at 20:17, Anton Hägerstrand 
>> wrote:
>>
>>> Hi! Anton from Blunders here.
>>>
>>> I will take a look as soon as possible, most likely I will be able to
>>> tell what's going on from server logs. Thank you for reporting - I will put
>>> up better monitoring in the future.
>>>
>>> /Anton
>>>
>>>
>>> On Fri, 17 Mar 2023, 19:49 Marc D'Mello,  wrote:
>>>
 Hi all,

 I was looking at some of the profiles on Blunders (which is linked from
 the nightly benchmarking site:
 https://home.apache.org/~mikemccand/lucenebench/) and it seems like
 some of the latest Searching profiles are not working. For example:
 https://blunders.io/jfr-demo/searching-2023.03.16.18.02.48/jvm_info.
 The indexing profiles seem to be working fine as far as I can tell, so I
 wonder if this is a problem with how the nightly benchmarks are
 publishing data to the Blunders API.

 Thanks,
 Marc

>>>


Re: Blunders profiling for Searching is broken

2023-04-04 Thread Michael McCandless
Hmm, I'll try to figure out why the nightly benchy is uploading such
degenerate JFR zip files!

Mike McCandless

http://blog.mikemccandless.com


On Sat, Mar 18, 2023 at 5:16 AM Anton Hägerstrand  wrote:

> I've had a look now - it seems like the .jfr.gz files uploaded for these
> search benchmarks are empty (they are well formed gzip files, there is just
> no compressed data in them). Creation of these .jfr.gz files happens as
> part of the benchmark setup, where I don't (to my knowledge) have access to
> any logs.
>
> Mike McCandless is probably the person best suited to dig into this, but
> I'm here for any questions and will happily help debug the issue as much as
> I can.
>
> Blunders should also do a better job at saying "no data found" instead of
> throwing an error as well. I will look into this.
>
> Thank you
> Anton
>
> On Fri, 17 Mar 2023 at 20:17, Anton Hägerstrand  wrote:
>
>> Hi! Anton from Blunders here.
>>
>> I will take a look as soon as possible, most likely I will be able to
>> tell what's going on from server logs. Thank you for reporting - I will put
>> up better monitoring in the future.
>>
>> /Anton
>>
>>
>> On Fri, 17 Mar 2023, 19:49 Marc D'Mello,  wrote:
>>
>>> Hi all,
>>>
>>> I was looking at some of the profiles on Blunders (which is linked from
>>> the nightly benchmarking site:
>>> https://home.apache.org/~mikemccand/lucenebench/) and it seems like
>>> some of the latest Searching profiles are not working. For example:
>>> https://blunders.io/jfr-demo/searching-2023.03.16.18.02.48/jvm_info.
>>> The indexing profiles seem to be working fine as far as I can tell, so I
>>> wonder if this is a problem with how the nightly benchmarks are
>>> publishing data to the Blunders API.
>>>
>>> Thanks,
>>> Marc
>>>
>>


Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-04 Thread Michael Wechner
IIUC we all agree that the limit could be raised, but we need some solid 
reasoning what limit makes sense, resp. why do we set this particular 
limit (e.g. 2048), right?


Thanks

Michael


Am 04.04.23 um 15:32 schrieb Michael McCandless:
> I am not in favor of just doubling it as suggested by some people, I 
would ideally prefer a solution that remains there to a decent extent, 
rather than having to modifying it anytime someone requires a higher 
limit.


The problem with this approach is it is a one-way door, once 
released.  We would not be able to lower the limit again in the future 
without possibly breaking some applications.


> For example, we don't limit the number of docs per index to an 
arbitrary maximum of N, you push how many docs you like and if they 
are too much for your system, you get terrible 
performance/crashes/whatever.


Correction: we do check this limit and throw a specific exception now: 
https://github.com/apache/lucene/issues/6905


+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti 
 wrote:


... and what would be the next limit?
I guess we'll need to motivate it better than the 1024 one.
I appreciate the fact that a limit is pretty much wanted by
everyone but I suspect we'll need some solid foundation for
deciding the amount (and it should be high enough to avoid
continuous changes)

Cheers

On Sun, 2 Apr 2023, 07:29 Michael Wechner,
 wrote:

btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:

I'm also in favor of raising this limit. We do see some
datasets with higher than 1024 dims. I also think we need to
keep a limit. For example we currently need to keep all the
vectors in RAM while indexing and we want to be able to
support reasonable numbers of vectors in an index segment.
Also we don't know what innovations might come down the road.
Maybe someday we want to do product quantization and enforce
that (k, m) both fit in a byte -- we wouldn't be able to do
that if a vector's dimension were to exceed 32K.

On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti
 wrote:

I am also curious what would be the worst-case scenario
if we remove the constant at all (so automatically the
limit becomes the Java Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:

if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
throw new IllegalArgumentException(
"cannot index vectors with dimension greater than " +
ByteVectorValues.MAX_DIMENSIONS);
}


in relation to:

These limits allow us to
better tune our data structures, prevent overflows,
help ensure we
have good test coverage, etc.

I agree 100% especially for typing stuff properly and
avoiding resource waste here and there, but I am not
entirely sure this is the case for the current
implementation i.e. do we have optimizations in place
that assume the max dimension to be 1024?
If I missed that (and I likely have), I of course suggest
the contribution should not just blindly remove the
limit, but do it appropriately.
I am not in favor of just doubling it as suggested by
some people, I would ideally prefer a solution that
remains there to a decent extent, rather than having to
modifying it anytime someone requires a higher limit.

Cheers
--
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/

e-mail: a.benede...@sease.io/
/

*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io 
LinkedIn  |
Twitter  | Youtube
 |
Github 


On Fri, 31 Mar 2023 at 16:12, Michael Wechner
 wrote:

OpenAI reduced their size to 1536 dimensions

https://openai.com/blog/new-and-improved-embedding-model

so 2048 would work :-)

but other services do provide also higher dimensions
with sometimes
slightly better accuracy

Thanks

Michael


Am 31.03.23 um 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-04 Thread Shai Erera
I am not familiar with the internal implementation details, but is it
possible to refactor the code such that someone can provide an extension of
some VectorEncoder/Decoder and control the limits on their side? Rather
than Lucene committing to some arbitrary limit (which these days seems to
keep growing)?

If raising the limit only means changing some hard-coded constant, then I
assume such an abstraction can work. We can mark this extension as
@lucene.expert.

Shai


On Tue, Apr 4, 2023 at 4:33 PM Michael McCandless 
wrote:

> > I am not in favor of just doubling it as suggested by some people, I
> would ideally prefer a solution that remains there to a decent extent,
> rather than having to modifying it anytime someone requires a higher limit.
>
> The problem with this approach is it is a one-way door, once released.  We
> would not be able to lower the limit again in the future without possibly
> breaking some applications.
>
> > For example, we don't limit the number of docs per index to an
> arbitrary maximum of N, you push how many docs you like and if they are too
> much for your system, you get terrible performance/crashes/whatever.
>
> Correction: we do check this limit and throw a specific exception now:
> https://github.com/apache/lucene/issues/6905
>
> +1 to raise the limit, but not remove it.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti 
> wrote:
>
>> ... and what would be the next limit?
>> I guess we'll need to motivate it better than the 1024 one.
>> I appreciate the fact that a limit is pretty much wanted by everyone but
>> I suspect we'll need some solid foundation for deciding the amount (and it
>> should be high enough to avoid continuous changes)
>>
>> Cheers
>>
>> On Sun, 2 Apr 2023, 07:29 Michael Wechner, 
>> wrote:
>>
>>> btw, what was the reasoning to set the current limit to 1024?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>>
>>> I'm also in favor of raising this limit. We do see some datasets with
>>> higher than 1024 dims. I also think we need to keep a limit. For example we
>>> currently need to keep all the vectors in RAM while indexing and we want to
>>> be able to support reasonable numbers of vectors in an index segment. Also
>>> we don't know what innovations might come down the road. Maybe someday we
>>> want to do product quantization and enforce that (k, m) both fit in a byte
>>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>>> 32K.
>>>
>>> On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti <
>>> a.benede...@sease.io> wrote:
>>>
 I am also curious what would be the worst-case scenario if we remove
 the constant at all (so automatically the limit becomes the Java
 Integer.MAX_VALUE).
 i.e.
 right now if you exceed the limit you get:

> if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
> throw new IllegalArgumentException(
> "cannot index vectors with dimension greater than " + ByteVectorValues
> .MAX_DIMENSIONS);
> }


 in relation to:

> These limits allow us to
> better tune our data structures, prevent overflows, help ensure we
> have good test coverage, etc.


 I agree 100% especially for typing stuff properly and avoiding resource
 waste here and there, but I am not entirely sure this is the case for the
 current implementation i.e. do we have optimizations in place that assume
 the max dimension to be 1024?
 If I missed that (and I likely have), I of course suggest the
 contribution should not just blindly remove the limit, but do it
 appropriately.
 I am not in favor of just doubling it as suggested by some people, I
 would ideally prefer a solution that remains there to a decent extent,
 rather than having to modifying it anytime someone requires a higher limit.

 Cheers

 --
 *Alessandro Benedetti*
 Director @ Sease Ltd.
 *Apache Lucene/Solr Committer*
 *Apache Solr PMC Member*

 e-mail: a.benede...@sease.io


 *Sease* - Information Retrieval Applied
 Consulting | Training | Open Source

 Website: Sease.io 
 LinkedIn  | Twitter
  | Youtube
  | Github
 


 On Fri, 31 Mar 2023 at 16:12, Michael Wechner <
 michael.wech...@wyona.com> wrote:

> OpenAI reduced their size to 1536 dimensions
>
> https://openai.com/blog/new-and-improved-embedding-model
>
> so 2048 would work :-)
>
> but other services do provide also higher dimensions with sometimes
> slightly better accuracy
>
> Thanks
>
> Michael
>
>
> Am 31.03.23 um 14:45 schrieb Adrien Grand:
> > I'm 

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-04 Thread Michael McCandless
> I am not in favor of just doubling it as suggested by some people, I
would ideally prefer a solution that remains there to a decent extent,
rather than having to modifying it anytime someone requires a higher limit.

The problem with this approach is it is a one-way door, once released.  We
would not be able to lower the limit again in the future without possibly
breaking some applications.

> For example, we don't limit the number of docs per index to an arbitrary
maximum of N, you push how many docs you like and if they are too much for
your system, you get terrible performance/crashes/whatever.

Correction: we do check this limit and throw a specific exception now:
https://github.com/apache/lucene/issues/6905

+1 to raise the limit, but not remove it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti 
wrote:

> ... and what would be the next limit?
> I guess we'll need to motivate it better than the 1024 one.
> I appreciate the fact that a limit is pretty much wanted by everyone but I
> suspect we'll need some solid foundation for deciding the amount (and it
> should be high enough to avoid continuous changes)
>
> Cheers
>
> On Sun, 2 Apr 2023, 07:29 Michael Wechner, 
> wrote:
>
>> btw, what was the reasoning to set the current limit to 1024?
>>
>> Thanks
>>
>> Michael
>>
>> Am 01.04.23 um 14:47 schrieb Michael Sokolov:
>>
>> I'm also in favor of raising this limit. We do see some datasets with
>> higher than 1024 dims. I also think we need to keep a limit. For example we
>> currently need to keep all the vectors in RAM while indexing and we want to
>> be able to support reasonable numbers of vectors in an index segment. Also
>> we don't know what innovations might come down the road. Maybe someday we
>> want to do product quantization and enforce that (k, m) both fit in a byte
>> -- we wouldn't be able to do that if a vector's dimension were to exceed
>> 32K.
>>
>> On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> I am also curious what would be the worst-case scenario if we remove the
>>> constant at all (so automatically the limit becomes the Java
>>> Integer.MAX_VALUE).
>>> i.e.
>>> right now if you exceed the limit you get:
>>>
 if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
 throw new IllegalArgumentException(
 "cannot index vectors with dimension greater than " + ByteVectorValues.
 MAX_DIMENSIONS);
 }
>>>
>>>
>>> in relation to:
>>>
 These limits allow us to
 better tune our data structures, prevent overflows, help ensure we
 have good test coverage, etc.
>>>
>>>
>>> I agree 100% especially for typing stuff properly and avoiding resource
>>> waste here and there, but I am not entirely sure this is the case for the
>>> current implementation i.e. do we have optimizations in place that assume
>>> the max dimension to be 1024?
>>> If I missed that (and I likely have), I of course suggest the
>>> contribution should not just blindly remove the limit, but do it
>>> appropriately.
>>> I am not in favor of just doubling it as suggested by some people, I
>>> would ideally prefer a solution that remains there to a decent extent,
>>> rather than having to modifying it anytime someone requires a higher limit.
>>>
>>> Cheers
>>>
>>> --
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io 
>>> LinkedIn  | Twitter
>>>  | Youtube
>>>  | Github
>>> 
>>>
>>>
>>> On Fri, 31 Mar 2023 at 16:12, Michael Wechner 
>>> wrote:
>>>
 OpenAI reduced their size to 1536 dimensions

 https://openai.com/blog/new-and-improved-embedding-model

 so 2048 would work :-)

 but other services do provide also higher dimensions with sometimes
 slightly better accuracy

 Thanks

 Michael


 Am 31.03.23 um 14:45 schrieb Adrien Grand:
 > I'm supportive of bumping the limit on the maximum dimension for
 > vectors to something that is above what the majority of users need,
 > but I'd like to keep a limit. We have limits for other things like the
 > max number of docs per index, the max term length, the max number of
 > dimensions of points, etc. and there are a few things that we don't
 > have limits on that I wish we had limits on. These limits allow us to
 > better tune our data structures, prevent overflows, help ensure we
 > have good test coverage, etc.
 >
 > That said, these other limits we have in place are quite high. E.g.
 > the 32kB term limit, nobody would ever type a 32kB term in a