On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
<[email protected]> wrote:
>
> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API 
> estimates the storage of each field by iterating its structures (i.e., 
> inverted index, doc-values, stored fields, etc.) and tracking the number of 
> read-bytes. The result is pretty fast and accurate.
>
> I am +1 to the proposal.
>

I like an approach such as this, enumerate the index, using something
like FilterDirectory to track the bytes. It doesn't require you to
force-merge all the data through addIndexes, and at the same time it
doesn't invade the codec apis.
The user can always force-merge the data themselves for situations
such as benchmarks/tracking space over time, otherwise the
fluctuations from merges could create too much noise.
Personally, I would suggest separate api/tool from CheckIndex, perhaps
this tracking could mask bugs? No reason to mix the two concerns.
Also, the tool can be much more efficient than checkindex, e.g. for
stored fields and vectors it can just retrieve the first and last
documents, whereas checkindex should verify all of the documents
slowly.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to