Re: Special characters in Beam Schema field names

Brian Hulette Thu, 19 Mar 2020 17:06:24 -0700

I'm +1 on using the SQL (quoting) convention to handle special characters
when inputting a field name, rather than an escape character.


On Thu, Mar 19, 2020 at 2:24 PM Reuven Lax <[email protected]> wrote:

> This sounds fine. We'd have to make our parser for Select clauses be a bit
> smarter, but it shouldn't be too difficult to extend the grammar to handle
> escape characters.
>
> On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <[email protected]> wrote:
>
>> I favor allowing field names to contain any unicode character,
>> semantically. I do not think encoding is a semantic property of a field
>> name (or even a string in a particular programming language) so UTF-8
>> doesn't need to be part of it. Inputting a field name in a particular
>> context is separable from what characters can occur in the name, and the
>> encoding of a string when it is turned into bytes is orthogonal to what
>> characters are in the string.
>>
>> SQL has a good convention to allow any character (backticks, as you
>> demonstrated), as do most unix shells / filesystems. Note again that
>> backtick and backslash conventions are how to _input_ a field name, not the
>> characters actually in the field name. Your example of "parent.child" is a
>> good one, too: the dot is not part of any field name, but just a way to
>> input a list of names to construct a path. And your later example of using
>> backticks around the dot works perfectly if you want a dot in the field
>> name. This is a solved problem IMO, and we just have to take a solution off
>> the shelf.
>>
>> Since schemas are pretty closely related with SQL, how about just using
>> these particular SQL conventions? I like backticks and I also like
>> backslashes.
>>
>> For debuggability, we need to always print a properly unparsed
>> identifier, not just print the field name as a string. So in the example of
>> "we use _ rather than the more natural . when concatenating field names in
>> a nested select" I would prefer to just use a dot, for clarity, and when
>> printing it the position of the backticks will make it totally clear that
>> the dot is not a field separator.
>>
>> Kenn
>>
>> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> Give the flexibility of SQL, and the diversity of upstream systems,
>>> I'd lean on the side of being maximally flexible and saying a field
>>> name is a utf-8 string (including whitespace?), but special characters
>>> may require quoting and/or not allow some convenience (e.g. POJO
>>> creation).
>>>
>>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <[email protected]>
>>> wrote:
>>> >
>>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow
>>> (quoted) field names to contain any character. So it's currently possible
>>> for SqlTransform to produce schemas with field names containing dots and
>>> other special characters, which we can't handle properly outside of the SQL
>>> context. If we do want to have some special characters, I think we should
>>> validate that schemas don't contain them, which would limit what you can
>>> output with SqlTransform, for better or worse.
>>> >
>>> > > We impose limits on Beam field names, and have automatic ways of
>>> escaping or translating characters that don't match. When the Beam field
>>> name does not match the field name in other systems, we use field Options
>>> to store the "original" name so it is not lost. That way we don't have to
>>> rely on the field names always being textually identical.
>>> >
>>> > A good use of the new Options feature :)
>>> > One of the problems I would like this thread to solve though is the
>>> possibility of using schemas and rows for the Options themselves (discussed
>>> extensively in Alex's PR [3]). If we use Options to handle special
>>> characters, we would need options on the schema of the Options (I think I
>>> said that right?) to solve it in that context.
>>> >
>>> > > I'm all for initial strict naming rules, that we can relax as we
>>> learn more. Additional restrictions tend to require major version changes
>>> to accommodate the backwards incompatibility.
>>> >
>>> > I think it may be too late to be strict though, since schemas came
>>> from SQL, and both supported SQL dialects are very permissive here. At this
>>> point it seems easier to be very permissive within Beam, and provide ways
>>> to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
>>> to translate fields for language types, raising errors when a schema is
>>> incompatible for some IO, etc).
>>> >
>>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>>> > [2]
>>> https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>>> > [3] https://github.com/apache/beam/pull/10413
>>> >
>>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <[email protected]>
>>> wrote:
>>> >>
>>> >> I'm all for initial strict naming rules, that we can relax as we
>>> learn more. Additional restrictions tend to require major version changes
>>> to accommodate the backwards incompatibility.
>>> >>
>>> >> I'd rather community provide compelling use cases for relaxations
>>> than us speculating what could be useful in the outset.
>>> >>
>>> >> That said, it might be a touch late for schema fields...
>>> >>
>>> >> It's definitely my Go Bias showing but a sensible start is to not
>>> allow fields to start with a digit. This matches most C derived languages
>>> (which includes all our SDK languages at present, except maybe for Scio...).
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <[email protected]> wrote:
>>> >>>
>>> >>> For completeness, here's another proposal.
>>> >>>
>>> >>> We impose limits on Beam field names, and have automatic ways of
>>> escaping or translating characters that don't match. When the Beam field
>>> name does not match the field name in other systems, we use field Options
>>> to store the "original" name so it is not lost. That way we don't have to
>>> rely on the field names always being textually identical.
>>> >>>
>>> >>> Downside here: any time we automatically munge a field name, we make
>>> select statements a bit more awkward, as the user has to put the munged
>>> field name into the select.
>>> >>>
>>> >>> Reuven
>>> >>>
>>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]>
>>> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <
>>> [email protected]> wrote:
>>> >>>>>>
>>> >>>>>> In Beam schemas we don't seem to have a well-defined policy
>>> around special characters (like $.[]) in field names. There's never any
>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ rather
>>> than the more natural . when concatenating field names in a nested select
>>> [1])
>>> >>>>>>
>>> >>>>>> I think we should explicitly allow any special character (any
>>> valid UTF-8 character?) to be used in Beam schema field names. But in order
>>> to do this we will need to provide solutions for some edge cases. To my
>>> knowledge there are two problems that arise with some special characters in
>>> field names:
>>> >>>>>>
>>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>> NamedTuples in python).
>>> >>>>>
>>> >>>>>
>>> >>>>> We already have this problem - i.e. if you name a schema field to
>>> be int, or any other reserved string. We should disambiguate.
>>> >>>>
>>> >>>> True, but as I point out below we have ways to deal with this
>>> problem. (2) is really the problem we need to solve.
>>> >>>>>
>>> >>>>>
>>> >>>>>>
>>> >>>>>> 2. It can make field accesses ambiguous (i.e. does
>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>> with that exact name or a nested field?).
>>> >>>>>
>>> >>>>>
>>> >>>>> I still think that we should reserve _some_ special characters.
>>> I'm not sure what the use is for allowing any character to be used.
>>> >>>>
>>> >>>> The use would be ensuring that we don't run into compatibility
>>> issues when mapping schemas from other systems that have made different
>>> choices about which characters are special.
>>> >>>>>
>>> >>>>>
>>> >>>>>>
>>> >>>>>> We already have some precedent for (1) - Beam SQL produces field
>>> names like `$col1` for unaliased fields in query outputs, and this is
>>> allowed. If a user wants to map a schema with a field like this to a POJO,
>>> they have to first rename the incompatible field(s), or use an
>>> @SchemaFieldName annotation to map the field name. I think these are
>>> reasonable solutions.
>>> >>>>>>
>>> >>>>>> We do not have a solution for (2) though. I think we should allow
>>> the use of a backslash to escape characters that otherwise have special
>>> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>> >>>>
>>> >>>> I think the SQL way of handling this is to require a field name to
>>> be wrapped in some way when it contains special characters, e.g.
>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Does anyone have any objection to this proposal, or is there
>>> anything I'm overlooking? If not, I'm happy to take the task to implement
>>> the escape character change.
>>> >>>>>>
>>> >>>>>> Brian
>>> >>>>>>
>>> >>>>>> [1]
>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>> >>>>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>
>>

Re: Special characters in Beam Schema field names

Reply via email to