Can someone take a quick look at https://github.com/apache/beam/pull/27202?
If things look good, let's try getting it in before the release cut as I'm
also updating our cross-language documentation and would like to include
these changes.

Thank you,
Ahmed

On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax <re...@google.com> wrote:

> The goal was to make schema transforms as efficient as hand-written
> coders. We found the avro encoding/decoding to often be quite inefficient,
> which is one reason we didn't use it for schemas.
>
> Our schema encoding is internal to Beam though, and not suggested for use
> external to a pipeline. For sources or sinks we still recommend using Avro
> (or proto).
>
> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <
>> ahmedabuals...@google.com> wrote:
>>
>>> Thank you all for your input. I have a PR for the changes I mentioned in
>>> my initial email: https://github.com/apache/beam/pull/27202. Please
>>> review when you get a chance!
>>>
>>> > perhaps we should consider just going to something Avro for portable
>>> coding rather than something custom
>>>
>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>>> elements? We would still run into the problem Cham mentioned earlier (of
>>> making sure existing PTransform inputs/outputs are compatible with
>>> cross-language-valid types).
>>>
>>
>> I don't remember why Avro was rejected in favor of our own encoding
>> format, but it probably doesn't make sense to revisit that without
>> understanding the full history. I do know that avro versioning and diamond
>> dependencies cause a lot of pain for users and there's a concerted effort
>> to remove Avro from Beam core altogether.
>>
>> In any case, this is quite orthogonal to the proposal here which we
>> should move forward on.
>>
>>
>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <byronel...@google.com>
>>> wrote:
>>>
>>>> Sure, I get that… though perhaps we should consider just going to
>>>> something Avro for portable coding rather than something custom.
>>>>
>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>> Input/output PCollection types at least have to be portable Beam types
>>>>> [1] for cross-language to work.
>>>>>
>>>>> I think we restricted schema-aware transforms to PCollection<Row>
>>>>> since Row was expected to be an efficient replacement for arbitrary
>>>>> portable Beam types (not sure how true that is in practice currently).
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>>
>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <byronel...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Is it actually necessary for a PTransform that is configured via the
>>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as 
>>>>>> two
>>>>>> separate concerns and unnecessarily limiting.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>>> chamik...@google.com> wrote:
>>>>>>
>>>>>>> +1 for the simplification.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>> rober...@google.com> wrote:
>>>>>>>
>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way 
>>>>>>>> that
>>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>
>>>>>>>
>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>> PTransforms to be schema-aware transform compatible (for example, change
>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>> types). But this makes sense for new transforms.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <byronel...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, 
>>>>>>>>> more
>>>>>>>>> DRY.
>>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>>> provides no value.
>>>>>>>>>>
>>>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes 
>>>>>>>>>> a
>>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>>> ideally with something as simple as adding a decorator to the 
>>>>>>>>>> PTransform
>>>>>>>>>> itself.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>>> service. From what I see, there may be a redundant step in 
>>>>>>>>>>> developing
>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>>> - A configuration object
>>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>>
>>>>>>>>>>> The API is generally used like this:
>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>>>> returns a SchemaTransform
>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according
>>>>>>>>>>> to the configuration
>>>>>>>>>>>
>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider return 
>>>>>>>>>>> the
>>>>>>>>>>> PTransform directly.
>>>>>>>>>>>
>>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in 
>>>>>>>>>>> our API is
>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd 
>>>>>>>>>>> just trim
>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>>
>>>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>>>> process of this API so I may be missing some information on why it 
>>>>>>>>>>> was set
>>>>>>>>>>> up this way.
>>>>>>>>>>>
>>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java 
>>>>>>>>>>> transform
>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I 
>>>>>>>>>>> may be
>>>>>>>>>>> missing some information, but my assumption is this was designed to 
>>>>>>>>>>> follow
>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with
>>>>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and 
>>>>>>>>>>> reduced
>>>>>>>>>>> the SchemaTransform class to have a generic buildTransform() 
>>>>>>>>>>> method. See
>>>>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>>
>>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>>> contribution to cross-language use cases, we should make it simpler 
>>>>>>>>>>> and
>>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>>
>>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>>> applying these changes to them should be straightforward. If people 
>>>>>>>>>>> think
>>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ahmed
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>>> [2]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>>> [3]
>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>>> [4]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>>> [5]
>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>>> [6]
>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to