Hi, In <d6e52b13-a822-4c2f-8e9e-6023c2dd8...@python.org> "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 22:11:54 +0200, Antoine Pitrou <anto...@python.org> wrote:
>>>> Fields: >>>> | Name | Type | Comments | >>>> |----------------|-----------------------| -------- | >>>> | column | utf8 | (2) | >>>> | key | utf8 not null | (3) | >>> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make >>> the representation more efficient where there are many columns? >> Dictionary is more efficient. But we need to standardize not >> only key but also ID -> key mapping. > > I don't get why we would need to standardize ID -> key mapping. The > key names would be significant, the dictionary mapping is just for > efficiency. Ah, space efficiency was only discussed here, right? I thought that computational efficiency is also discussed here. If we standardize ID -> key mapping, consumers don't need to compare key names. Example: We want to find "distinct_count" statistics. If we standardize ID -> key mapping (1 -> "distinct_count"), consumers can find "distinct_count" statistics by finding ID 1 entry. If we don't standardize ID -> key mapping, consumers need to compare key name to find "distinct_count" statistics. Anyway, this (string comparison) will not be a large overhead because (1) statistics data will not be large data and (2) consumers can cache ID -> key mapping to avoid duplicated string comparisons. So standardizing ID -> key mapping isn't required. Thanks, -- kou