Can you explain more about " that current sinks for Avro and Parquet with
the destination of GCS are not supported"?

We do have AvroIO and ParquetIO (
https://beam.apache.org/documentation/io/connectors/) in Python.

On Wed, Mar 13, 2024 at 5:35 PM Ondřej Pánek <ondrej.pa...@bighub.cz> wrote:

> Hello Beam team!
>
>
>
> We’re currently onboarding customer’s infrastructure to the Google Cloud
> Platform. The decision was made that one of the technologies they will use
> is Dataflow. Let me briefly the usecase specification:
>
> They have kafka cluster where data from CDC data source is stored. The
> data in the topics is stored as Avro format. Their other requirement is
> they want to have a streaming solution reading from these Kafka topics, and
> writing to the Google Cloud Storage again in Avro. What’s more, the
> component should be written in Python, since their Data Engineers heavily
> prefer Python instead of Java.
>
>
>
> We’ve been struggling with the design of the solution for couple of weeks
> now, and we’re facing quite unfortunate situation now, not really finding
> any solution that would fit these requirements.
>
>
>
> So the question is: Is there any existing Dataflow template/solution with
> the following specifications:
>
>    - Streaming connector
>    - Written in Python
>    - Consumes from Kafka topics
>    - Reads Avro with Schema Registry
>    - Writes Avro to GCS
>
>
>
> We found out, that current sinks for Avro and Parquet with the destination
> of GCS are not supported for Python at the moment, which is basically the
> main blocker now.
>
>
>
> Any recommendations/suggestions would be really highly appreciated!
>
>
>
> Maybe the solution really does not exist and we need to create our own
> custom connector for it. The question in this case would be if that’s even
> possible theoretically, since we would really need to avoid another dead
> end.
>
>
>
> Thanks a lot for any help!
>
>
>
> Kind regards,
>
> Ondrej
>

Reply via email to