Proposal to Reimplement Disk Usage API - Request for Feedback and Collaboration

Deepika Sharma Wed, 24 May 2023 12:19:28 -0700

Dear Community

I am writing to share thoughts on the existing Disk Usage API, I believe
there is an opportunity to improve its functionality and performance
through a reimplementation.
Currently, the best tool we have for this is based on a custom Codec that
separates storage by field; to get the statistics we read an existing index
and write it out using AddIndexes and force-merging, using the custom
codec. This is time-consuming and inefficient and tends not to get done.
What we could do is similar to the functionality in Elasticsearch. The
DiskUsage API <https://github.com/elastic/elasticsearch/pull/74051>
estimates the storage of each field by iterating its structures (i.e.,
inverted index, doc-values, stored fields, etc.) and tracking the number of
read-bytes. Since we will enumerate the index, it wouldn't require us to
force-merge all the data through addIndexes, and at the same time it
doesn't invade the codec apis.


Thank you for your time and consideration. I would greatly appreciate any
input, suggestions, or concerns you might have regarding this proposal and
eagerly look forward to your response.

Best regards,

Proposal to Reimplement Disk Usage API - Request for Feedback and Collaboration

Reply via email to