+1

Will really help with visibility.

On Tue, 14 Jun 2022, 00:56 Nhat Nguyen, <nhat.ngu...@elastic.co.invalid>
wrote:

> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API
> <https://github.com/elastic/elasticsearch/pull/74051> estimates the
> storage of each field by iterating its structures (i.e., inverted index,
> doc-values, stored fields, etc.) and tracking the number of read-bytes. The
> result is pretty fast and accurate.
>
> I am +1 to the proposal.
>
> Thanks,
> Nhat
>
> On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
>
>> At Amazon, we have a need to produce regular metrics on how much disk
>> storage is consumed by each field. We manage an index with data
>> contributed by many teams and business units and we are often asked to
>> produce reports attributing index storage usage to these customers.
>> The best tool we have for this today is based on a custom Codec that
>> separates storage by field; to get the statistics we read an existing
>> index and write it out using AddIndexes and force-merging, using the
>> custom codec. This is time-consuming and inefficient and tends not to
>> get done.
>>
>> I wonder if it would make sense to add methods to *some* API that
>> would expose a per-field disk space metric? If we don't want to add to
>> IndexReader, which would imply lots of intermediate methods and API
>> additions, maybe we could make it be computed by CheckIndex?
>>
>> (implementation note: For the current formats, the information for
>> each field is always segregated by field, I think. I suppose that in
>> theory we might want to have some shared data structure across fields
>> some day, but it seems like an edge case that we could handle in some
>> exceptional way.)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Reply via email to