On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote:
> > > On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]> > wrote: > >> In Beam schemas we don't seem to have a well-defined policy around >> special characters (like $.[]) in field names. There's never any explicit >> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the >> more natural . when concatenating field names in a nested select [1]) >> >> I think we should explicitly allow any special character (any valid UTF-8 >> character?) to be used in Beam schema field names. But in order to do this >> we will need to provide solutions for some edge cases. To my knowledge >> there are two problems that arise with some special characters in field >> names: >> > 1. They can't be mapped to language types (e.g. Java Classes, and >> NamedTuples in python). >> > > We already have this problem - i.e. if you name a schema field to be int, > or any other reserved string. We should disambiguate. > True, but as I point out below we have ways to deal with this problem. (2) is really the problem we need to solve. > > >> 2. It can make field accesses ambiguous (i.e. does >> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field >> with that exact name or a nested field?). >> > > I still think that we should reserve _some_ special characters. I'm not > sure what the use is for allowing any character to be used. > The use would be ensuring that we don't run into compatibility issues when mapping schemas from other systems that have made different choices about which characters are special. > > >> We already have some precedent for (1) - Beam SQL produces field names >> like `$col1` for unaliased fields in query outputs, and this is allowed. If >> a user wants to map a schema with a field like this to a POJO, they have to >> first rename the incompatible field(s), or use an @SchemaFieldName >> annotation to map the field name. I think these are reasonable solutions. >> >> We do not have a solution for (2) though. I think we should allow the use >> of a backslash to escape characters that otherwise have special meaning for >> FieldAccessDescriptors (based on [2] this is .[]{}*). >> > I think the SQL way of handling this is to require a field name to be wrapped in some way when it contains special characters, e.g. "`some.parent.field`.`some.child.field`". We could consider that as well. > >> Does anyone have any objection to this proposal, or is there anything I'm >> overlooking? If not, I'm happy to take the task to implement the escape >> character change. >> >> Brian >> >> [1] >> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189 >> [2] >> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4 >> >
