I favor allowing field names to contain any unicode character, semantically. I do not think encoding is a semantic property of a field name (or even a string in a particular programming language) so UTF-8 doesn't need to be part of it. Inputting a field name in a particular context is separable from what characters can occur in the name, and the encoding of a string when it is turned into bytes is orthogonal to what characters are in the string.
SQL has a good convention to allow any character (backticks, as you demonstrated), as do most unix shells / filesystems. Note again that backtick and backslash conventions are how to _input_ a field name, not the characters actually in the field name. Your example of "parent.child" is a good one, too: the dot is not part of any field name, but just a way to input a list of names to construct a path. And your later example of using backticks around the dot works perfectly if you want a dot in the field name. This is a solved problem IMO, and we just have to take a solution off the shelf. Since schemas are pretty closely related with SQL, how about just using these particular SQL conventions? I like backticks and I also like backslashes. For debuggability, we need to always print a properly unparsed identifier, not just print the field name as a string. So in the example of "we use _ rather than the more natural . when concatenating field names in a nested select" I would prefer to just use a dot, for clarity, and when printing it the position of the backticks will make it totally clear that the dot is not a field separator. Kenn On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <[email protected]> wrote: > Give the flexibility of SQL, and the diversity of upstream systems, > I'd lean on the side of being maximally flexible and saying a field > name is a utf-8 string (including whitespace?), but special characters > may require quoting and/or not allow some convenience (e.g. POJO > creation). > > On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <[email protected]> wrote: > > > > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow > (quoted) field names to contain any character. So it's currently possible > for SqlTransform to produce schemas with field names containing dots and > other special characters, which we can't handle properly outside of the SQL > context. If we do want to have some special characters, I think we should > validate that schemas don't contain them, which would limit what you can > output with SqlTransform, for better or worse. > > > > > We impose limits on Beam field names, and have automatic ways of > escaping or translating characters that don't match. When the Beam field > name does not match the field name in other systems, we use field Options > to store the "original" name so it is not lost. That way we don't have to > rely on the field names always being textually identical. > > > > A good use of the new Options feature :) > > One of the problems I would like this thread to solve though is the > possibility of using schemas and rows for the Options themselves (discussed > extensively in Alex's PR [3]). If we use Options to handle special > characters, we would need options on the schema of the Options (I think I > said that right?) to solve it in that context. > > > > > I'm all for initial strict naming rules, that we can relax as we learn > more. Additional restrictions tend to require major version changes to > accommodate the backwards incompatibility. > > > > I think it may be too late to be strict though, since schemas came from > SQL, and both supported SQL dialects are very permissive here. At this > point it seems easier to be very permissive within Beam, and provide ways > to deal with incompatibilities at the boundaries (e.g. SDKs providing ways > to translate fields for language types, raising errors when a schema is > incompatible for some IO, etc). > > > > [1] https://calcite.apache.org/docs/reference.html#identifiers > > [2] > https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers > > [3] https://github.com/apache/beam/pull/10413 > > > > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <[email protected]> wrote: > >> > >> I'm all for initial strict naming rules, that we can relax as we learn > more. Additional restrictions tend to require major version changes to > accommodate the backwards incompatibility. > >> > >> I'd rather community provide compelling use cases for relaxations than > us speculating what could be useful in the outset. > >> > >> That said, it might be a touch late for schema fields... > >> > >> It's definitely my Go Bias showing but a sensible start is to not allow > fields to start with a digit. This matches most C derived languages (which > includes all our SDK languages at present, except maybe for Scio...). > >> > >> > >> > >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <[email protected]> wrote: > >>> > >>> For completeness, here's another proposal. > >>> > >>> We impose limits on Beam field names, and have automatic ways of > escaping or translating characters that don't match. When the Beam field > name does not match the field name in other systems, we use field Options > to store the "original" name so it is not lost. That way we don't have to > rely on the field names always being textually identical. > >>> > >>> Downside here: any time we automatically munge a field name, we make > select statements a bit more awkward, as the user has to put the munged > field name into the select. > >>> > >>> Reuven > >>> > >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]> > wrote: > >>>> > >>>> > >>>> > >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]> > wrote: > >>>>>> > >>>>>> In Beam schemas we don't seem to have a well-defined policy around > special characters (like $.[]) in field names. There's never any explicit > validation, but we do have some ad-hoc rules (e.g. we use _ rather than the > more natural . when concatenating field names in a nested select [1]) > >>>>>> > >>>>>> I think we should explicitly allow any special character (any valid > UTF-8 character?) to be used in Beam schema field names. But in order to do > this we will need to provide solutions for some edge cases. To my knowledge > there are two problems that arise with some special characters in field > names: > >>>>>> > >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and > NamedTuples in python). > >>>>> > >>>>> > >>>>> We already have this problem - i.e. if you name a schema field to be > int, or any other reserved string. We should disambiguate. > >>>> > >>>> True, but as I point out below we have ways to deal with this > problem. (2) is really the problem we need to solve. > >>>>> > >>>>> > >>>>>> > >>>>>> 2. It can make field accesses ambiguous (i.e. does > `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field > with that exact name or a nested field?). > >>>>> > >>>>> > >>>>> I still think that we should reserve _some_ special characters. I'm > not sure what the use is for allowing any character to be used. > >>>> > >>>> The use would be ensuring that we don't run into compatibility issues > when mapping schemas from other systems that have made different choices > about which characters are special. > >>>>> > >>>>> > >>>>>> > >>>>>> We already have some precedent for (1) - Beam SQL produces field > names like `$col1` for unaliased fields in query outputs, and this is > allowed. If a user wants to map a schema with a field like this to a POJO, > they have to first rename the incompatible field(s), or use an > @SchemaFieldName annotation to map the field name. I think these are > reasonable solutions. > >>>>>> > >>>>>> We do not have a solution for (2) though. I think we should allow > the use of a backslash to escape characters that otherwise have special > meaning for FieldAccessDescriptors (based on [2] this is .[]{}*). > >>>> > >>>> I think the SQL way of handling this is to require a field name to be > wrapped in some way when it contains special characters, e.g. > "`some.parent.field`.`some.child.field`". We could consider that as well. > >>>>>> > >>>>>> > >>>>>> Does anyone have any objection to this proposal, or is there > anything I'm overlooking? If not, I'm happy to take the task to implement > the escape character change. > >>>>>> > >>>>>> Brian > >>>>>> > >>>>>> [1] > https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189 > >>>>>> [2] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4 >
