Re: Special characters in Beam Schema field names

Robert Bradshaw Wed, 18 Mar 2020 17:10:25 -0700

Give the flexibility of SQL, and the diversity of upstream systems,
I'd lean on the side of being maximally flexible and saying a field
name is a utf-8 string (including whitespace?), but special characters
may require quoting and/or not allow some convenience (e.g. POJO
creation).


On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <[email protected]> wrote:
>
> Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted) 
> field names to contain any character. So it's currently possible for 
> SqlTransform to produce schemas with field names containing dots and other 
> special characters, which we can't handle properly outside of the SQL 
> context. If we do want to have some special characters, I think we should 
> validate that schemas don't contain them, which would limit what you can 
> output with SqlTransform, for better or worse.
>
> > We impose limits on Beam field names, and have automatic ways of escaping 
> > or translating characters that don't match. When the Beam field name does 
> > not match the field name in other systems, we use field Options to store 
> > the "original" name so it is not lost. That way we don't have to rely on 
> > the field names always being textually identical.
>
> A good use of the new Options feature :)
> One of the problems I would like this thread to solve though is the 
> possibility of using schemas and rows for the Options themselves (discussed 
> extensively in Alex's PR [3]). If we use Options to handle special 
> characters, we would need options on the schema of the Options (I think I 
> said that right?) to solve it in that context.
>
> > I'm all for initial strict naming rules, that we can relax as we learn 
> > more. Additional restrictions tend to require major version changes to 
> > accommodate the backwards incompatibility.
>
> I think it may be too late to be strict though, since schemas came from SQL, 
> and both supported SQL dialects are very permissive here. At this point it 
> seems easier to be very permissive within Beam, and provide ways to deal with 
> incompatibilities at the boundaries (e.g. SDKs providing ways to translate 
> fields for language types, raising errors when a schema is incompatible for 
> some IO, etc).
>
> [1] https://calcite.apache.org/docs/reference.html#identifiers
> [2] https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
> [3] https://github.com/apache/beam/pull/10413
>
> On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <[email protected]> wrote:
>>
>> I'm all for initial strict naming rules, that we can relax as we learn more. 
>> Additional restrictions tend to require major version changes to accommodate 
>> the backwards incompatibility.
>>
>> I'd rather community provide compelling use cases for relaxations than us 
>> speculating what could be useful in the outset.
>>
>> That said, it might be a touch late for schema fields...
>>
>> It's definitely my Go Bias showing but a sensible start is to not allow 
>> fields to start with a digit. This matches most C derived languages (which 
>> includes all our SDK languages at present, except maybe for Scio...).
>>
>>
>>
>> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <[email protected]> wrote:
>>>
>>> For completeness, here's another proposal.
>>>
>>> We impose limits on Beam field names, and have automatic ways of escaping 
>>> or translating characters that don't match. When the Beam field name does 
>>> not match the field name in other systems, we use field Options to store 
>>> the "original" name so it is not lost. That way we don't have to rely on 
>>> the field names always being textually identical.
>>>
>>> Downside here: any time we automatically munge a field name, we make select 
>>> statements a bit more awkward, as the user has to put the munged field name 
>>> into the select.
>>>
>>> Reuven
>>>
>>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]> 
>>>>> wrote:
>>>>>>
>>>>>> In Beam schemas we don't seem to have a well-defined policy around 
>>>>>> special characters (like $.[]) in field names. There's never any 
>>>>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ 
>>>>>> rather than the more natural . when concatenating field names in a 
>>>>>> nested select [1])
>>>>>>
>>>>>> I think we should explicitly allow any special character (any valid 
>>>>>> UTF-8 character?) to be used in Beam schema field names. But in order to 
>>>>>> do this we will need to provide solutions for some edge cases. To my 
>>>>>> knowledge there are two problems that arise with some special characters 
>>>>>> in field names:
>>>>>>
>>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and 
>>>>>> NamedTuples in python).
>>>>>
>>>>>
>>>>> We already have this problem - i.e. if you name a schema field to be int, 
>>>>> or any other reserved string. We should disambiguate.
>>>>
>>>> True, but as I point out below we have ways to deal with this problem. (2) 
>>>> is really the problem we need to solve.
>>>>>
>>>>>
>>>>>>
>>>>>> 2. It can make field accesses ambiguous (i.e. does 
>>>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field 
>>>>>> with that exact name or a nested field?).
>>>>>
>>>>>
>>>>> I still think that we should reserve _some_ special characters. I'm not 
>>>>> sure what the use is for allowing any character to be used.
>>>>
>>>> The use would be ensuring that we don't run into compatibility issues when 
>>>> mapping schemas from other systems that have made different choices about 
>>>> which characters are special.
>>>>>
>>>>>
>>>>>>
>>>>>> We already have some precedent for (1) - Beam SQL produces field names 
>>>>>> like `$col1` for unaliased fields in query outputs, and this is allowed. 
>>>>>> If a user wants to map a schema with a field like this to a POJO, they 
>>>>>> have to first rename the incompatible field(s), or use an 
>>>>>> @SchemaFieldName annotation to map the field name. I think these are 
>>>>>> reasonable solutions.
>>>>>>
>>>>>> We do not have a solution for (2) though. I think we should allow the 
>>>>>> use of a backslash to escape characters that otherwise have special 
>>>>>> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>>
>>>> I think the SQL way of handling this is to require a field name to be 
>>>> wrapped in some way when it contains special characters, e.g. 
>>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>>>>>
>>>>>>
>>>>>> Does anyone have any objection to this proposal, or is there anything 
>>>>>> I'm overlooking? If not, I'm happy to take the task to implement the 
>>>>>> escape character change.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> [1] 
>>>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>>>> [2] 
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4

Re: Special characters in Beam Schema field names

Reply via email to