Hello, the recent dataset and compute work has forced us to think about schema projection. One problem that surfaced is referencing fields in nested schemas and/or schemas where duplicate column names exists. We currently have (C++) APIs that either pass a vector<int> or a vector<std::string> to represent fields subset, both way poses challenges:
- Referencing a column by index can't access sub-fields of nested type. - Referencing a column by name can return more than one field. Thus, Ben drafted a PR [1] to allow referencing fields in (hopefully) non-ambiguous way. This is divided into 2 concepts: - FieldPath: A stack of indices pointing into nested structures. It points to exactly one field, or none if ill formed. If the depth is one, it is equivalent to referencing a field by index. - FieldRef: A friendlier version that supports referencing by names and/or a tiny string DSL similar to JSONPath. One can "dereference" a FieldRef into a FieldPath given a schema. Since it supports name component, a FieldRef can expand to more than one FieldPath. We'd like to standardise most C++ APIs where a vector of indices (or names) is given as an indicator of subset of columns to use this new facility. For this reason, we'd like feedback on the implementation. I encourage other language developers to look at this as they'll likely face the same issues. Thank you, François [1] https://github.com/apache/arrow/pull/6545