Oh, yes that's a clever idea. It seems it would take quite a while (tens of minutes?) for a larger index though? Much faster than the force-merge solution for sure. I guess to get faster we would have to instrument each format. I mean they generally do know how much space each field is occupying, but perhaps it's too much API change to expose that.
On Tue, Jun 14, 2022 at 12:09 AM Nhat Nguyen <[email protected]> wrote: > > >> Also, the tool can be much more efficient than checkindex, e.g. for >> stored fields and vectors it can just retrieve the first and last >> documents, whereas checkindex should verify all of the documents >> slowly. > > > Yes, we implemented a similar heuristic in the DiskUsage API in Elasticsearch. > > On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <[email protected]> wrote: >> >> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen >> <[email protected]> wrote: >> > >> > Hi Michael, >> > >> > We developed a similar functionality in Elasticsearch. The DiskUsage API >> > estimates the storage of each field by iterating its structures (i.e., >> > inverted index, doc-values, stored fields, etc.) and tracking the number >> > of read-bytes. The result is pretty fast and accurate. >> > >> > I am +1 to the proposal. >> > >> >> I like an approach such as this, enumerate the index, using something >> like FilterDirectory to track the bytes. It doesn't require you to >> force-merge all the data through addIndexes, and at the same time it >> doesn't invade the codec apis. >> The user can always force-merge the data themselves for situations >> such as benchmarks/tracking space over time, otherwise the >> fluctuations from merges could create too much noise. >> Personally, I would suggest separate api/tool from CheckIndex, perhaps >> this tracking could mask bugs? No reason to mix the two concerns. >> Also, the tool can be much more efficient than checkindex, e.g. for >> stored fields and vectors it can just retrieve the first and last >> documents, whereas checkindex should verify all of the documents >> slowly. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
