On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen <[email protected]> wrote: > > Hi Michael, > > We developed a similar functionality in Elasticsearch. The DiskUsage API > estimates the storage of each field by iterating its structures (i.e., > inverted index, doc-values, stored fields, etc.) and tracking the number of > read-bytes. The result is pretty fast and accurate. > > I am +1 to the proposal. >
I like an approach such as this, enumerate the index, using something like FilterDirectory to track the bytes. It doesn't require you to force-merge all the data through addIndexes, and at the same time it doesn't invade the codec apis. The user can always force-merge the data themselves for situations such as benchmarks/tracking space over time, otherwise the fluctuations from merges could create too much noise. Personally, I would suggest separate api/tool from CheckIndex, perhaps this tracking could mask bugs? No reason to mix the two concerns. Also, the tool can be much more efficient than checkindex, e.g. for stored fields and vectors it can just retrieve the first and last documents, whereas checkindex should verify all of the documents slowly. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
