Hello,

the recent dataset and compute work has forced us to think about
schema projection. One problem that surfaced is referencing fields in
nested schemas and/or schemas where duplicate column names exists. We
currently have (C++) APIs that either pass a vector<int> or a
vector<std::string> to represent fields subset, both way poses
challenges:

- Referencing a column by index can't access sub-fields of nested type.
- Referencing a column by name can return more than one field.

Thus, Ben drafted a PR [1] to allow referencing fields in (hopefully)
non-ambiguous way. This is divided into 2 concepts:

- FieldPath: A stack of indices pointing into nested structures. It
points to exactly one field, or none if ill formed. If the depth is
one, it is equivalent to referencing a field by index.
- FieldRef: A friendlier version that supports referencing by names
and/or a tiny string DSL similar to JSONPath. One can "dereference" a
FieldRef into a FieldPath given a schema. Since it supports name
component, a FieldRef can expand to more than one FieldPath.

We'd like to standardise most C++ APIs where a vector of indices (or
names) is given as an indicator of subset of columns to use this new
facility. For this reason, we'd like feedback on the implementation. I
encourage other language developers to look at this as they'll likely
face the same issues.

Thank you,
François

[1] https://github.com/apache/arrow/pull/6545

Reply via email to