Re: [DISCUSS] Statistics through the C data interface

Adam Lippai Sun, 09 Jun 2024 15:29:44 -0700

It’s not strictly statistics, but would this also cover constraints and
indexes? Table, recordbatch and column primary keys, unique keys, sort
keys, bloom filters, hnsw index and shape (ndarray for keys xyz).


Not sure which backends (DB, parquet, lance) expose which natively, but
might worth considering it for a minute.

Best regards,
Adam Lippai

On Sun, Jun 9, 2024 at 17:36 Sutou Kouhei <[email protected]> wrote:

> Hi,
>
> In <[email protected]>
>   "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun
> 2024 22:11:54 +0200,
>   Antoine Pitrou <[email protected]> wrote:
>
> >>>> Fields:
> >>>> | Name           | Type                  | Comments |
> >>>> |----------------|-----------------------| -------- |
> >>>> | column         | utf8                  | (2)      |
> >>>> | key            | utf8 not null         | (3)      |
> >>>
> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make
> >>> the representation more efficient where there are many columns?
> >> Dictionary is more efficient. But we need to standardize not
> >> only key but also ID -> key mapping.
> >
> > I don't get why we would need to standardize ID -> key mapping. The
> > key names would be significant, the dictionary mapping is just for
> > efficiency.
>
> Ah, space efficiency was only discussed here, right? I
> thought that computational efficiency is also discussed
> here. If we standardize ID -> key mapping, consumers don't
> need to compare key names.
>
> Example: We want to find "distinct_count" statistics.
>
> If we standardize ID -> key mapping (1 -> "distinct_count"),
> consumers can find "distinct_count" statistics by finding ID
> 1 entry.
>
> If we don't standardize ID -> key mapping, consumers need to
> compare key name to find "distinct_count" statistics.
>
>
> Anyway, this (string comparison) will not be a large
> overhead because (1) statistics data will not be large data
> and (2) consumers can cache ID -> key mapping to avoid
> duplicated string comparisons. So standardizing ID -> key
> mapping isn't required.
>
>
> Thanks,
> --
> kou
>

Re: [DISCUSS] Statistics through the C data interface

Reply via email to