Give the flexibility of SQL, and the diversity of upstream systems, I'd lean on the side of being maximally flexible and saying a field name is a utf-8 string (including whitespace?), but special characters may require quoting and/or not allow some convenience (e.g. POJO creation).
On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <bhule...@google.com> wrote: > > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted) > field names to contain any character. So it's currently possible for > SqlTransform to produce schemas with field names containing dots and other > special characters, which we can't handle properly outside of the SQL > context. If we do want to have some special characters, I think we should > validate that schemas don't contain them, which would limit what you can > output with SqlTransform, for better or worse. > > > We impose limits on Beam field names, and have automatic ways of escaping > > or translating characters that don't match. When the Beam field name does > > not match the field name in other systems, we use field Options to store > > the "original" name so it is not lost. That way we don't have to rely on > > the field names always being textually identical. > > A good use of the new Options feature :) > One of the problems I would like this thread to solve though is the > possibility of using schemas and rows for the Options themselves (discussed > extensively in Alex's PR [3]). If we use Options to handle special > characters, we would need options on the schema of the Options (I think I > said that right?) to solve it in that context. > > > I'm all for initial strict naming rules, that we can relax as we learn > > more. Additional restrictions tend to require major version changes to > > accommodate the backwards incompatibility. > > I think it may be too late to be strict though, since schemas came from SQL, > and both supported SQL dialects are very permissive here. At this point it > seems easier to be very permissive within Beam, and provide ways to deal with > incompatibilities at the boundaries (e.g. SDKs providing ways to translate > fields for language types, raising errors when a schema is incompatible for > some IO, etc). > > [1] https://calcite.apache.org/docs/reference.html#identifiers > [2] https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers > [3] https://github.com/apache/beam/pull/10413 > > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <rob...@frantil.com> wrote: >> >> I'm all for initial strict naming rules, that we can relax as we learn more. >> Additional restrictions tend to require major version changes to accommodate >> the backwards incompatibility. >> >> I'd rather community provide compelling use cases for relaxations than us >> speculating what could be useful in the outset. >> >> That said, it might be a touch late for schema fields... >> >> It's definitely my Go Bias showing but a sensible start is to not allow >> fields to start with a digit. This matches most C derived languages (which >> includes all our SDK languages at present, except maybe for Scio...). >> >> >> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <re...@google.com> wrote: >>> >>> For completeness, here's another proposal. >>> >>> We impose limits on Beam field names, and have automatic ways of escaping >>> or translating characters that don't match. When the Beam field name does >>> not match the field name in other systems, we use field Options to store >>> the "original" name so it is not lost. That way we don't have to rely on >>> the field names always being textually identical. >>> >>> Downside here: any time we automatically munge a field name, we make select >>> statements a bit more awkward, as the user has to put the munged field name >>> into the select. >>> >>> Reuven >>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <bhule...@google.com> wrote: >>>> >>>> >>>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <re...@google.com> wrote: >>>>> >>>>> >>>>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <bhule...@google.com> >>>>> wrote: >>>>>> >>>>>> In Beam schemas we don't seem to have a well-defined policy around >>>>>> special characters (like $.[]) in field names. There's never any >>>>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ >>>>>> rather than the more natural . when concatenating field names in a >>>>>> nested select [1]) >>>>>> >>>>>> I think we should explicitly allow any special character (any valid >>>>>> UTF-8 character?) to be used in Beam schema field names. But in order to >>>>>> do this we will need to provide solutions for some edge cases. To my >>>>>> knowledge there are two problems that arise with some special characters >>>>>> in field names: >>>>>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and >>>>>> NamedTuples in python). >>>>> >>>>> >>>>> We already have this problem - i.e. if you name a schema field to be int, >>>>> or any other reserved string. We should disambiguate. >>>> >>>> True, but as I point out below we have ways to deal with this problem. (2) >>>> is really the problem we need to solve. >>>>> >>>>> >>>>>> >>>>>> 2. It can make field accesses ambiguous (i.e. does >>>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field >>>>>> with that exact name or a nested field?). >>>>> >>>>> >>>>> I still think that we should reserve _some_ special characters. I'm not >>>>> sure what the use is for allowing any character to be used. >>>> >>>> The use would be ensuring that we don't run into compatibility issues when >>>> mapping schemas from other systems that have made different choices about >>>> which characters are special. >>>>> >>>>> >>>>>> >>>>>> We already have some precedent for (1) - Beam SQL produces field names >>>>>> like `$col1` for unaliased fields in query outputs, and this is allowed. >>>>>> If a user wants to map a schema with a field like this to a POJO, they >>>>>> have to first rename the incompatible field(s), or use an >>>>>> @SchemaFieldName annotation to map the field name. I think these are >>>>>> reasonable solutions. >>>>>> >>>>>> We do not have a solution for (2) though. I think we should allow the >>>>>> use of a backslash to escape characters that otherwise have special >>>>>> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*). >>>> >>>> I think the SQL way of handling this is to require a field name to be >>>> wrapped in some way when it contains special characters, e.g. >>>> "`some.parent.field`.`some.child.field`". We could consider that as well. >>>>>> >>>>>> >>>>>> Does anyone have any objection to this proposal, or is there anything >>>>>> I'm overlooking? If not, I'm happy to take the task to implement the >>>>>> escape character change. >>>>>> >>>>>> Brian >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189 >>>>>> [2] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4