Re: Special characters in Beam Schema field names

Robert Burke Thu, 19 Mar 2020 17:43:41 -0700

Ah well. This shouldn't present a problem for implementation in Go at
least, with he intent of using strict field tags. By the spec,
https://golang.org/ref/spec#Struct_types Tags are string litterals
https://golang.org/ref/spec#String_literals and by convention, are comma
delimited key:value pairs, so we can specify to our heart's content within
that if users want to hardcode complex SQL result columns explicitly.


(No formal design exists for Beam Schemas in Go just yet, though I'll
produce something in the coming months. Collaboration welcome, of course!)

On Thu, Mar 19, 2020, 5:06 PM Brian Hulette <[email protected]> wrote:

> I'm +1 on using the SQL (quoting) convention to handle special characters
> when inputting a field name, rather than an escape character.
>
> On Thu, Mar 19, 2020 at 2:24 PM Reuven Lax <[email protected]> wrote:
>
>> This sounds fine. We'd have to make our parser for Select clauses be a
>> bit smarter, but it shouldn't be too difficult to extend the grammar to
>> handle escape characters.
>>
>> On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <[email protected]> wrote:
>>
>>> I favor allowing field names to contain any unicode character,
>>> semantically. I do not think encoding is a semantic property of a field
>>> name (or even a string in a particular programming language) so UTF-8
>>> doesn't need to be part of it. Inputting a field name in a particular
>>> context is separable from what characters can occur in the name, and the
>>> encoding of a string when it is turned into bytes is orthogonal to what
>>> characters are in the string.
>>>
>>> SQL has a good convention to allow any character (backticks, as you
>>> demonstrated), as do most unix shells / filesystems. Note again that
>>> backtick and backslash conventions are how to _input_ a field name, not the
>>> characters actually in the field name. Your example of "parent.child" is a
>>> good one, too: the dot is not part of any field name, but just a way to
>>> input a list of names to construct a path. And your later example of using
>>> backticks around the dot works perfectly if you want a dot in the field
>>> name. This is a solved problem IMO, and we just have to take a solution off
>>> the shelf.
>>>
>>> Since schemas are pretty closely related with SQL, how about just using
>>> these particular SQL conventions? I like backticks and I also like
>>> backslashes.
>>>
>>> For debuggability, we need to always print a properly unparsed
>>> identifier, not just print the field name as a string. So in the example of
>>> "we use _ rather than the more natural . when concatenating field names in
>>> a nested select" I would prefer to just use a dot, for clarity, and when
>>> printing it the position of the backticks will make it totally clear that
>>> the dot is not a field separator.
>>>
>>> Kenn
>>>
>>> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> Give the flexibility of SQL, and the diversity of upstream systems,
>>>> I'd lean on the side of being maximally flexible and saying a field
>>>> name is a utf-8 string (including whitespace?), but special characters
>>>> may require quoting and/or not allow some convenience (e.g. POJO
>>>> creation).
>>>>
>>>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <[email protected]>
>>>> wrote:
>>>> >
>>>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow
>>>> (quoted) field names to contain any character. So it's currently possible
>>>> for SqlTransform to produce schemas with field names containing dots and
>>>> other special characters, which we can't handle properly outside of the SQL
>>>> context. If we do want to have some special characters, I think we should
>>>> validate that schemas don't contain them, which would limit what you can
>>>> output with SqlTransform, for better or worse.
>>>> >
>>>> > > We impose limits on Beam field names, and have automatic ways of
>>>> escaping or translating characters that don't match. When the Beam field
>>>> name does not match the field name in other systems, we use field Options
>>>> to store the "original" name so it is not lost. That way we don't have to
>>>> rely on the field names always being textually identical.
>>>> >
>>>> > A good use of the new Options feature :)
>>>> > One of the problems I would like this thread to solve though is the
>>>> possibility of using schemas and rows for the Options themselves (discussed
>>>> extensively in Alex's PR [3]). If we use Options to handle special
>>>> characters, we would need options on the schema of the Options (I think I
>>>> said that right?) to solve it in that context.
>>>> >
>>>> > > I'm all for initial strict naming rules, that we can relax as we
>>>> learn more. Additional restrictions tend to require major version changes
>>>> to accommodate the backwards incompatibility.
>>>> >
>>>> > I think it may be too late to be strict though, since schemas came
>>>> from SQL, and both supported SQL dialects are very permissive here. At this
>>>> point it seems easier to be very permissive within Beam, and provide ways
>>>> to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
>>>> to translate fields for language types, raising errors when a schema is
>>>> incompatible for some IO, etc).
>>>> >
>>>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>>>> > [2]
>>>> https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>>>> > [3] https://github.com/apache/beam/pull/10413
>>>> >
>>>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> I'm all for initial strict naming rules, that we can relax as we
>>>> learn more. Additional restrictions tend to require major version changes
>>>> to accommodate the backwards incompatibility.
>>>> >>
>>>> >> I'd rather community provide compelling use cases for relaxations
>>>> than us speculating what could be useful in the outset.
>>>> >>
>>>> >> That said, it might be a touch late for schema fields...
>>>> >>
>>>> >> It's definitely my Go Bias showing but a sensible start is to not
>>>> allow fields to start with a digit. This matches most C derived languages
>>>> (which includes all our SDK languages at present, except maybe for 
>>>> Scio...).
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <[email protected]> wrote:
>>>> >>>
>>>> >>> For completeness, here's another proposal.
>>>> >>>
>>>> >>> We impose limits on Beam field names, and have automatic ways of
>>>> escaping or translating characters that don't match. When the Beam field
>>>> name does not match the field name in other systems, we use field Options
>>>> to store the "original" name so it is not lost. That way we don't have to
>>>> rely on the field names always being textually identical.
>>>> >>>
>>>> >>> Downside here: any time we automatically munge a field name, we
>>>> make select statements a bit more awkward, as the user has to put the
>>>> munged field name into the select.
>>>> >>>
>>>> >>> Reuven
>>>> >>>
>>>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]>
>>>> wrote:
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <
>>>> [email protected]> wrote:
>>>> >>>>>>
>>>> >>>>>> In Beam schemas we don't seem to have a well-defined policy
>>>> around special characters (like $.[]) in field names. There's never any
>>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ rather
>>>> than the more natural . when concatenating field names in a nested select
>>>> [1])
>>>> >>>>>>
>>>> >>>>>> I think we should explicitly allow any special character (any
>>>> valid UTF-8 character?) to be used in Beam schema field names. But in order
>>>> to do this we will need to provide solutions for some edge cases. To my
>>>> knowledge there are two problems that arise with some special characters in
>>>> field names:
>>>> >>>>>>
>>>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes,
>>>> and NamedTuples in python).
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> We already have this problem - i.e. if you name a schema field to
>>>> be int, or any other reserved string. We should disambiguate.
>>>> >>>>
>>>> >>>> True, but as I point out below we have ways to deal with this
>>>> problem. (2) is really the problem we need to solve.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> 2. It can make field accesses ambiguous (i.e. does
>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>>> with that exact name or a nested field?).
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> I still think that we should reserve _some_ special characters.
>>>> I'm not sure what the use is for allowing any character to be used.
>>>> >>>>
>>>> >>>> The use would be ensuring that we don't run into compatibility
>>>> issues when mapping schemas from other systems that have made different
>>>> choices about which characters are special.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> We already have some precedent for (1) - Beam SQL produces field
>>>> names like `$col1` for unaliased fields in query outputs, and this is
>>>> allowed. If a user wants to map a schema with a field like this to a POJO,
>>>> they have to first rename the incompatible field(s), or use an
>>>> @SchemaFieldName annotation to map the field name. I think these are
>>>> reasonable solutions.
>>>> >>>>>>
>>>> >>>>>> We do not have a solution for (2) though. I think we should
>>>> allow the use of a backslash to escape characters that otherwise have
>>>> special meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>> >>>>
>>>> >>>> I think the SQL way of handling this is to require a field name to
>>>> be wrapped in some way when it contains special characters, e.g.
>>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Does anyone have any objection to this proposal, or is there
>>>> anything I'm overlooking? If not, I'm happy to take the task to implement
>>>> the escape character change.
>>>> >>>>>>
>>>> >>>>>> Brian
>>>> >>>>>>
>>>> >>>>>> [1]
>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>> >>>>>> [2]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>>
>>>

Re: Special characters in Beam Schema field names

Reply via email to