For completeness, here's another proposal.

We impose limits on Beam field names, and have automatic ways of escaping
or translating characters that don't match. When the Beam field name does
not match the field name in other systems, we use field Options to store
the "original" name so it is not lost. That way we don't have to rely on
the field names always being textually identical.

Downside here: any time we automatically munge a field name, we make select
statements a bit more awkward, as the user has to put the munged field name
into the select.

Reuven

On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]> wrote:

>
>
> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote:
>
>>
>>
>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]>
>> wrote:
>>
>>> In Beam schemas we don't seem to have a well-defined policy around
>>> special characters (like $.[]) in field names. There's never any explicit
>>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>>> more natural . when concatenating field names in a nested select [1])
>>>
>>> I think we should explicitly allow any special character (any valid
>>> UTF-8 character?) to be used in Beam schema field names. But in order to do
>>> this we will need to provide solutions for some edge cases. To my knowledge
>>> there are two problems that arise with some special characters in field
>>> names:
>>>
>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>> NamedTuples in python).
>>>
>>
>> We already have this problem - i.e. if you name a schema field to be int,
>> or any other reserved string. We should disambiguate.
>>
> True, but as I point out below we have ways to deal with this problem. (2)
> is really the problem we need to solve.
>
>>
>>
>>> 2. It can make field accesses ambiguous (i.e. does
>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>> with that exact name or a nested field?).
>>>
>>
>> I still think that we should reserve _some_ special characters. I'm not
>> sure what the use is for allowing any character to be used.
>>
> The use would be ensuring that we don't run into compatibility issues when
> mapping schemas from other systems that have made different choices about
> which characters are special.
>
>>
>>
>>> We already have some precedent for (1) - Beam SQL produces field names
>>> like `$col1` for unaliased fields in query outputs, and this is allowed. If
>>> a user wants to map a schema with a field like this to a POJO, they have to
>>> first rename the incompatible field(s), or use an @SchemaFieldName
>>> annotation to map the field name. I think these are reasonable solutions.
>>>
>>> We do not have a solution for (2) though. I think we should allow the
>>> use of a backslash to escape characters that otherwise have special meaning
>>> for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>
>> I think the SQL way of handling this is to require a field name to be
> wrapped in some way when it contains special characters, e.g.
> "`some.parent.field`.`some.child.field`". We could consider that as well.
>
>>
>>> Does anyone have any objection to this proposal, or is there anything
>>> I'm overlooking? If not, I'm happy to take the task to implement the escape
>>> character change.
>>>
>>> Brian
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>
>>

Reply via email to