Dear Community I am writing to share thoughts on the existing Disk Usage API, I believe there is an opportunity to improve its functionality and performance through a reimplementation. Currently, the best tool we have for this is based on a custom Codec that separates storage by field; to get the statistics we read an existing index and write it out using AddIndexes and force-merging, using the custom codec. This is time-consuming and inefficient and tends not to get done. What we could do is similar to the functionality in Elasticsearch. The DiskUsage API <https://github.com/elastic/elasticsearch/pull/74051> estimates the storage of each field by iterating its structures (i.e., inverted index, doc-values, stored fields, etc.) and tracking the number of read-bytes. Since we will enumerate the index, it wouldn't require us to force-merge all the data through addIndexes, and at the same time it doesn't invade the codec apis.
Thank you for your time and consideration. I would greatly appreciate any input, suggestions, or concerns you might have regarding this proposal and eagerly look forward to your response. Best regards,