Re: exposing per-field storage usage

2022-06-13 Thread Nhat Nguyen
> Also, the tool can be much more efficient than checkindex, e.g. for
> stored fields and vectors it can just retrieve the first and last
> documents, whereas checkindex should verify all of the documents
> slowly.


Yes, we implemented a similar heuristic in the DiskUsage API in
Elasticsearch.

On Mon, Jun 13, 2022 at 11:27 PM Robert Muir  wrote:

> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
>  wrote:
> >
> > Hi Michael,
> >
> > We developed a similar functionality in Elasticsearch. The DiskUsage API
> estimates the storage of each field by iterating its structures (i.e.,
> inverted index, doc-values, stored fields, etc.) and tracking the number of
> read-bytes. The result is pretty fast and accurate.
> >
> > I am +1 to the proposal.
> >
>
> I like an approach such as this, enumerate the index, using something
> like FilterDirectory to track the bytes. It doesn't require you to
> force-merge all the data through addIndexes, and at the same time it
> doesn't invade the codec apis.
> The user can always force-merge the data themselves for situations
> such as benchmarks/tracking space over time, otherwise the
> fluctuations from merges could create too much noise.
> Personally, I would suggest separate api/tool from CheckIndex, perhaps
> this tracking could mask bugs? No reason to mix the two concerns.
> Also, the tool can be much more efficient than checkindex, e.g. for
> stored fields and vectors it can just retrieve the first and last
> documents, whereas checkindex should verify all of the documents
> slowly.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: exposing per-field storage usage

2022-06-13 Thread Robert Muir
On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
 wrote:
>
> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API 
> estimates the storage of each field by iterating its structures (i.e., 
> inverted index, doc-values, stored fields, etc.) and tracking the number of 
> read-bytes. The result is pretty fast and accurate.
>
> I am +1 to the proposal.
>

I like an approach such as this, enumerate the index, using something
like FilterDirectory to track the bytes. It doesn't require you to
force-merge all the data through addIndexes, and at the same time it
doesn't invade the codec apis.
The user can always force-merge the data themselves for situations
such as benchmarks/tracking space over time, otherwise the
fluctuations from merges could create too much noise.
Personally, I would suggest separate api/tool from CheckIndex, perhaps
this tracking could mask bugs? No reason to mix the two concerns.
Also, the tool can be much more efficient than checkindex, e.g. for
stored fields and vectors it can just retrieve the first and last
documents, whereas checkindex should verify all of the documents
slowly.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [VOTE] Release Lucene/Solr 8.11.2 RC2

2022-06-13 Thread Tomás Fernández Löbbe
+1

SUCCESS! [1:02:16.559513]

On Mon, Jun 13, 2022 at 12:07 PM Mike Drob  wrote:

> Please vote for release candidate 2 for Lucene/Solr 8.11.2
>
> The artifacts can be downloaded from:
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.11.2-RC2-rev17dee71932c683e345508113523e764c3e4c80fa
>
> You can run the smoke tester directly with this command:
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.11.2-RC2-rev17dee71932c683e345508113523e764c3e4c80fa
>
> The vote will be open for at least 72 hours i.e. until 2022-06-16 20:00 UTC
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
> 
>


Re: exposing per-field storage usage

2022-06-13 Thread Atri Sharma
+1

Will really help with visibility.

On Tue, 14 Jun 2022, 00:56 Nhat Nguyen, 
wrote:

> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API
>  estimates the
> storage of each field by iterating its structures (i.e., inverted index,
> doc-values, stored fields, etc.) and tracking the number of read-bytes. The
> result is pretty fast and accurate.
>
> I am +1 to the proposal.
>
> Thanks,
> Nhat
>
> On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov 
> wrote:
>
>> At Amazon, we have a need to produce regular metrics on how much disk
>> storage is consumed by each field. We manage an index with data
>> contributed by many teams and business units and we are often asked to
>> produce reports attributing index storage usage to these customers.
>> The best tool we have for this today is based on a custom Codec that
>> separates storage by field; to get the statistics we read an existing
>> index and write it out using AddIndexes and force-merging, using the
>> custom codec. This is time-consuming and inefficient and tends not to
>> get done.
>>
>> I wonder if it would make sense to add methods to *some* API that
>> would expose a per-field disk space metric? If we don't want to add to
>> IndexReader, which would imply lots of intermediate methods and API
>> additions, maybe we could make it be computed by CheckIndex?
>>
>> (implementation note: For the current formats, the information for
>> each field is always segregated by field, I think. I suppose that in
>> theory we might want to have some shared data structure across fields
>> some day, but it seems like an edge case that we could handle in some
>> exceptional way.)
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: exposing per-field storage usage

2022-06-13 Thread Nhat Nguyen
Hi Michael,

We developed a similar functionality in Elasticsearch. The DiskUsage API
 estimates the storage
of each field by iterating its structures (i.e., inverted index,
doc-values, stored fields, etc.) and tracking the number of read-bytes. The
result is pretty fast and accurate.

I am +1 to the proposal.

Thanks,
Nhat

On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov  wrote:

> At Amazon, we have a need to produce regular metrics on how much disk
> storage is consumed by each field. We manage an index with data
> contributed by many teams and business units and we are often asked to
> produce reports attributing index storage usage to these customers.
> The best tool we have for this today is based on a custom Codec that
> separates storage by field; to get the statistics we read an existing
> index and write it out using AddIndexes and force-merging, using the
> custom codec. This is time-consuming and inefficient and tends not to
> get done.
>
> I wonder if it would make sense to add methods to *some* API that
> would expose a per-field disk space metric? If we don't want to add to
> IndexReader, which would imply lots of intermediate methods and API
> additions, maybe we could make it be computed by CheckIndex?
>
> (implementation note: For the current formats, the information for
> each field is always segregated by field, I think. I suppose that in
> theory we might want to have some shared data structure across fields
> some day, but it seems like an edge case that we could handle in some
> exceptional way.)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


[VOTE] Release Lucene/Solr 8.11.2 RC2

2022-06-13 Thread Mike Drob
Please vote for release candidate 2 for Lucene/Solr 8.11.2

The artifacts can be downloaded from:
https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.11.2-RC2-rev17dee71932c683e345508113523e764c3e4c80fa

You can run the smoke tester directly with this command:

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.11.2-RC2-rev17dee71932c683e345508113523e764c3e4c80fa

The vote will be open for at least 72 hours i.e. until 2022-06-16 20:00 UTC

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1



Re: [VOTE] Release Lucene/Solr 8.11.2 RC1

2022-06-13 Thread Mike Drob
This RC did not receive enough votes to pass, I've fixed the bug pointed
out by Houston and will be moving on to RC2. Thanks!

On Sun, Jun 12, 2022 at 2:57 PM Mike Drob  wrote:

> Thanks for finding that, Houston! It was an issue during backporting that
> I've corrected. I'll respin and put up a new RC with the fix.
>
> On Sat, Jun 11, 2022 at 11:21 AM Houston Putman 
> wrote:
>
>> +0
>>
>> SUCCESS! [1:02:38.547629]
>>
>> I saw this in the example logs during the smoketester:
>>
>>> ps: Invalid process id: i��\r\001
>>> Waiting up to 180 seconds to see Solr running on port 8983 [/]
>>> Started Solr server on port 8983 (pid=16758). Happy searching!
>>>
>> This seems related to SOLR-16191
>> , which is included in
>> this release.
>> Not exactly sure what went wrong, but the example still passed?
>>
>> - Houston
>>
>> On Wed, Jun 8, 2022 at 8:50 PM Mike Drob  wrote:
>>
>>> to: dev@lucene, dev@solr
>>>
>>> Please vote for release candidate 1 for Lucene/Solr 8.11.2
>>>
>>> The artifacts can be downloaded from:
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.11.2-RC1-reva9ed1e5fccbd1a84c78194a1329a7e1a3032ffc6
>>>
>>> You can run the smoke tester directly with this command:
>>>
>>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.11.2-RC1-reva9ed1e5fccbd1a84c78194a1329a7e1a3032ffc6
>>>
>>> Please see draft release notes (and edit as appropriate) at
>>> https://cwiki.apache.org/confluence/display/LUCENE/ReleaseNote8_11_2
>>> https://cwiki.apache.org/confluence/display/SOLR/ReleaseNote8_11_2
>>>
>>> The vote will be open for at least 72 hours not including weekend i.e.
>>> until 2022-06-14 01:00 UTC.
>>>
>>> [ ] +1  approve
>>> [ ] +0  no opinion
>>> [ ] -1  disapprove (and reason why)
>>>
>>> Here is my +1
>>>
>>


exposing per-field storage usage

2022-06-13 Thread Michael Sokolov
At Amazon, we have a need to produce regular metrics on how much disk
storage is consumed by each field. We manage an index with data
contributed by many teams and business units and we are often asked to
produce reports attributing index storage usage to these customers.
The best tool we have for this today is based on a custom Codec that
separates storage by field; to get the statistics we read an existing
index and write it out using AddIndexes and force-merging, using the
custom codec. This is time-consuming and inefficient and tends not to
get done.

I wonder if it would make sense to add methods to *some* API that
would expose a per-field disk space metric? If we don't want to add to
IndexReader, which would imply lots of intermediate methods and API
additions, maybe we could make it be computed by CheckIndex?

(implementation note: For the current formats, the information for
each field is always segregated by field, I think. I suppose that in
theory we might want to have some shared data structure across fields
some day, but it seems like an edge case that we could handle in some
exceptional way.)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 30% query performance degradation for documents with small stored fields

2022-06-13 Thread Adrien Grand
> Is my understanding correct that changing only block size and disabling
preset dictionaries are the changes that won't likely require re-indexing
and could be as easily carried over to the next Lucene versions? I
understand there is no guarantee, but curious to know your opinion because
it introduces additional risks to us.

This assessment looks correct to me.

On Tue, Jun 7, 2022 at 7:25 PM Alexander Lukyanchikov <
alexanderlukyanchi...@gmail.com> wrote:

> Hi Adrien, Michael
> Thank you, your responses are very helpful.
>
> > We're trying to have sensible defaults for the performance/compression
> trade-off in the default codec
> Sure, the compression improvement achieved with these changes is amazing
> and the fetch speed tradeoff makes a lot of sense since it's likely
> unnoticeable for a general use case with larger stored fields payload.
>
> > One approach that is supported consists of rewriting indexes to the
> default codec to perform upgrades using
> `IndexWriter#addIndexes(CodecReader)`
>
> That indeed could be really useful, although an ability to upgrade from
> the previous Lucene version without re-indexing is very important for us. *Is
> my understanding correct that changing only block size and disabling preset
> dictionaries are the changes that won't likely require re-indexing and
> could be as easily carried over to the next Lucene versions? I understand
> there is no guarantee, but curious to know your opinion because it
> introduces additional risks to us.*
>
> > I wonder whether it would be worth trying switching from stored fields
> to doc values
>
> Yes, that is something we considered before, but discarded due to access
> patterns specifics and the fact that payload size can also be large in some
> cases. Although in the future we will likely need to use doc values for a
> less generic feature, where small size is guaranteed.
>
> Regards,
> Alex
>
>
> On Tue, Jun 7, 2022 at 5:45 AM Michael Sokolov  wrote:
>
>> I wonder whether it would be worth trying switching from stored fields
>> to doc values. The access patterns are different, so the change would
>> not be trivial, but you might be able to achieve gains this way - I
>> really am not sure whether or not you would, the storage model is
>> completely different, but if you have a small number of fields, it
>> could be better?
>>
>> On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand  wrote:
>> >
>> > Hi Alexander,
>> >
>> > Sorry that these changes impacted your workload negatively. We're
>> trying to have sensible defaults for the performance/compression trade-off
>> in the default codec, and indeed our guidance is to write a custom codec
>> when it doesn't work. As you identified, Lucene only guarantees backward
>> compatibility of file formats for the default codec, so if you write a
>> custom codec you will have to maintain backward compatibility on your own.
>> >
>> > > Are there any less obvious ways to improve the situation for this use
>> case?
>> >
>> > I can't think of other work arounds.
>> >
>> > One approach that is supported consists of rewriting indexes to the
>> default codec to perform upgrades using
>> `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you
>> could rewrite it to the default codec, then upgrade to a new Lucene
>> version, and rewrite the index again using your custom codec. This doesn't
>> remove the maintenance overhead entirely, but it helps not have to worry
>> about backward compatibility of file formats.
>> >
>> > > does it make sense to expose related settings so users can tune the
>> compression without copying several internal classes?
>> >
>> > Lucene exposes ways to customize stored fields, look at the constructor
>> of `Lucene90CompressingStoredFieldsFormat` for instance, which allows
>> configuring block sizes, compression strategies, etc. These classes are
>> considered internal so the API is not stable, but they could be used to
>> avoid copying lots of code from Lucene's stored fields format.
>> >
>> > The consensus is that stored fields of the default codec shouldn't
>> expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is
>> already quite a burden in terms of testing and backward compatibility. The
>> idea of exposing more tuning options has been brought up a few times and
>> rejected.
>> >
>> > Not directly related to your question, but possibly still of interest
>> to you:
>> >  - We're now tracking the performance of stored fields on small
>> documents nightly:
>> http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
>> .
>> >  - If you're seeing a 30% performance degradation with recent changes
>> to stored fields, there are good chances that you could improve the
>> performance of this workload significantly with a custom codec that is
>> lighter on compression.
>> >
>> >
>> > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
>> alexanderlukyanchi...@gmail.com> wrote:
>> >>
>> >> Hello everyone,
>> >> We are in the 

JDK 19: Rampdown Phase 1 + EA builds 26 & JDK 20: EA builds 1

2022-06-13 Thread David Delabassee

Greetings!

JDK 19 has now entered Rampdown Phase One (RDP1) [1], which means that 
the main-line has been forked into a dedicated JDK 19 stabilization 
repository. At this point, the overall JDK 19 feature set is frozen and 
no additional JEPs will be targeted to JDK 19. The stabilization 
repository is open for select bug fixes and, with approval, late 
low-risk enhancements per the JDK Release Process [2]. Any change pushed 
to the main line is now bound for JDK 20, unless it is explicitly 
back-ported to JDK 19.


The next few weeks should be leveraged to try to identify and resolve as 
many issues as possible, i.e. before JDK 19 enters the Release 
Candidates phase. Moreover, we encourage you to test your project with 
the `enable-preview` flag as described in this Quality Outreach Heads-up 
[3], and even if you don't intend to use Virtual Threads in the near future.


[1] https://mail.openjdk.java.net/pipermail/jdk-dev/2022-June/006735.html
[2] https://openjdk.java.net/jeps/3
[3] https://inside.java/2022/05/16/quality-heads-up/


## Heads-up - openjdk.java.net ➜ openjdk.org DNS transition

The OpenJDK infrastructure is moving from the old openjdk.java.net 
subdomain to the openjdk.org top-level domain. This will affect all 
active subdomains (i.e., bugs, cr, db, git, hg, mail, and wiki) and the 
old hostnames (*.openjdk.java.net) will now act as aliases for the new 
names. No actions are required as this transition should be transparent 
and is mostly done. It should be mentioned that https://jdk.java.net/ is 
not changing.


More infirmation can be found in the original proposal 
https://mail.openjdk.java.net/pipermail/discuss/2022-May/006089.html



## JDK 19 Early-Access builds

JDK 19 Early-Access builds 26 are now available [4], and are provided 
under the GNU General Public License v2, with the Classpath Exception. 
The Release Notes are available here [5]. Given that JDK 19 is now in 
RDP1, the initial JDK 20 Early-Access builds are now also available [6].


[4] https://jdk.java.net/19/
[5] https://jdk.java.net/19/release-notes
[6] https://jdk.java.net/20/


### JEPs integrated to JDK 19:
- JEP 405: Record Patterns (Preview)
- JEP 422: Linux/RISC-V Port
- JEP 424: Foreign Function & Memory API (Preview)
- JEP 425: Virtual Threads (Preview)
- JEP 426: Vector API (Fourth Incubator)
- JEP 427: Pattern Matching for switch (Third Preview)
- JEP 428: Structured Concurrency (Incubator)

### Recent changes that may be of interest:

Build 26:
- JDK-8284199: Implementation of Structured Concurrency (Incubator)
- JDK-8282662: Use List.of() factory method to reduce memory consumption
- JDK-8284780: Need methods to create pre-sized HashSet and LinkedHashSet
- JDK-8250950: Allow per-user and system wide configuration of a 
jpackaged app
- JDK-8236569: -Xss not multiple of 4K does not work for the main thread 
on macOS

- JDK-4511638: Double.toString(double) sometimes produces incorrect results
- JDK-8287714: Improve handling of JAVA_ARGS
- JDK-8286850: [macos] Add support for signing user provided app image
- JDK-8287425: Remove unnecessary register push for MacroAssembler::check_k…
- JDK-8283694: Improve bit manipulation and boolean to integer conversion o…
- JDK-8287522: StringConcatFactory: Add in prependers and mixers in batches

Build 25:
- JDK-8284960: Integration of JEP 426: Vector API (Fourth Incubator)
- JDK-8287244: Add bound check in indexed memory access var handle
- JDK-8287292: Improve TransformKey to pack more kinds of transforms effici…
- JDK-8287003: InputStreamReader::read() can return zero despite writing a …
- JDK-8287064: Modernize ProxyGenerator.PrimitiveTypeInfo

Build 24:
- JDK-8286908: ECDSA signature should not return parameters
- JDK-8261768: SelfDestructTimer should accept seconds
- JDK-8286304: Removal of diagnostic flag GCParallelVerificationEnabled
- JDK-8267038: Update IANA Language Subtag Registry to Version 2022-03-02
- JDK-8285517: System.getenv() returns unexpected value if environment vari…
- JDK-8285513: JFR: Add more static support for event classes
- JDK-8287024: G1: Improve the API boundary between HeapRegionRemSet and G1…
- JDK-8287139: aarch64 intrinsic for unsignedMultiplyHigh

Build 23:
- JDK-8282191: Implementation of Foreign Function & Memory API (Preview)
- JDK-8286090: Add RC2/RC4 to jdk.security.legacyAlgorithms
- JDK-8282080: Lambda deserialization fails for Object method references 
on interfaces
- JDK-6782021: It is not possible to read local computer certificates 
with the SunMSCAPI provider

- JDK-8282191: Implementation of Foreign Function & Memory API (Preview)
- JDK-8284194: Allow empty subject fields in keytool
- JDK-8209137: Add ability to bind to specific local address to HTTP client
- JDK-8286841: Add BigDecimal.TWO
- JDK-8287139: aarch64 intrinsic for unsignedMultiplyHigh
- JDK-8282160: JShell circularly-required classes cannot be defined
- JDK-8282280: Update Xerces to Version 2.12.2


## Topics of Interest

* Replacing Finalizers with