In Beam schemas we don't seem to have a well-defined policy around special
characters (like $.[]) in field names. There's never any explicit
validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
more natural . when concatenating field names in a nested select [1])
I think we should explicitly allow any special character (any valid UTF-8
character?) to be used in Beam schema field names. But in order to do this
we will need to provide solutions for some edge cases. To my knowledge
there are two problems that arise with some special characters in field
names:
1. They can't be mapped to language types (e.g. Java Classes, and
NamedTuples in python).
2. It can make field accesses ambiguous (i.e. does
`FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
with that exact name or a nested field?).
We already have some precedent for (1) - Beam SQL produces field names like
`$col1` for unaliased fields in query outputs, and this is allowed. If a
user wants to map a schema with a field like this to a POJO, they have to
first rename the incompatible field(s), or use an @SchemaFieldName
annotation to map the field name. I think these are reasonable solutions.
We do not have a solution for (2) though. I think we should allow the use
of a backslash to escape characters that otherwise have special meaning for
FieldAccessDescriptors (based on [2] this is .[]{}*).
Does anyone have any objection to this proposal, or is there anything I'm
overlooking? If not, I'm happy to take the task to implement the escape
character change.
Brian
[1]
https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4