Re: [DISCUSS] Statistics through the C data interface

Sutou Kouhei Sun, 09 Jun 2024 14:36:00 -0700

Hi,

In <d6e52b13-a822-4c2f-8e9e-6023c2dd8...@python.org>
  "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun 2024 
22:11:54 +0200,
  Antoine Pitrou <anto...@python.org> wrote:


>>>> Fields:
>>>> | Name           | Type                  | Comments |
>>>> |----------------|-----------------------| -------- |
>>>> | column         | utf8                  | (2)      |
>>>> | key            | utf8 not null         | (3)      |
>>>
>>> 1. Should the key be something like `dictionary(int32, utf8)` to make
>>> the representation more efficient where there are many columns?
>> Dictionary is more efficient. But we need to standardize not
>> only key but also ID -> key mapping.
> 
> I don't get why we would need to standardize ID -> key mapping. The
> key names would be significant, the dictionary mapping is just for
> efficiency.

Ah, space efficiency was only discussed here, right? I
thought that computational efficiency is also discussed
here. If we standardize ID -> key mapping, consumers don't
need to compare key names.

Example: We want to find "distinct_count" statistics.

If we standardize ID -> key mapping (1 -> "distinct_count"),
consumers can find "distinct_count" statistics by finding ID
1 entry.

If we don't standardize ID -> key mapping, consumers need to
compare key name to find "distinct_count" statistics.


Anyway, this (string comparison) will not be a large
overhead because (1) statistics data will not be large data
and (2) consumers can cache ID -> key mapping to avoid
duplicated string comparisons. So standardizing ID -> key
mapping isn't required.


Thanks,
-- 
kou

Re: [DISCUSS] Statistics through the C data interface

Reply via email to