+1 Will really help with visibility.
On Tue, 14 Jun 2022, 00:56 Nhat Nguyen, <nhat.ngu...@elastic.co.invalid> wrote: > Hi Michael, > > We developed a similar functionality in Elasticsearch. The DiskUsage API > <https://github.com/elastic/elasticsearch/pull/74051> estimates the > storage of each field by iterating its structures (i.e., inverted index, > doc-values, stored fields, etc.) and tracking the number of read-bytes. The > result is pretty fast and accurate. > > I am +1 to the proposal. > > Thanks, > Nhat > > On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov <msoko...@gmail.com> > wrote: > >> At Amazon, we have a need to produce regular metrics on how much disk >> storage is consumed by each field. We manage an index with data >> contributed by many teams and business units and we are often asked to >> produce reports attributing index storage usage to these customers. >> The best tool we have for this today is based on a custom Codec that >> separates storage by field; to get the statistics we read an existing >> index and write it out using AddIndexes and force-merging, using the >> custom codec. This is time-consuming and inefficient and tends not to >> get done. >> >> I wonder if it would make sense to add methods to *some* API that >> would expose a per-field disk space metric? If we don't want to add to >> IndexReader, which would imply lots of intermediate methods and API >> additions, maybe we could make it be computed by CheckIndex? >> >> (implementation note: For the current formats, the information for >> each field is always segregated by field, I think. I suppose that in >> theory we might want to have some shared data structure across fields >> some day, but it seems like an edge case that we could handle in some >> exceptional way.) >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>