Can you explain more about " that current sinks for Avro and Parquet with the destination of GCS are not supported"?
We do have AvroIO and ParquetIO ( https://beam.apache.org/documentation/io/connectors/) in Python. On Wed, Mar 13, 2024 at 5:35 PM Ondřej Pánek <ondrej.pa...@bighub.cz> wrote: > Hello Beam team! > > > > We’re currently onboarding customer’s infrastructure to the Google Cloud > Platform. The decision was made that one of the technologies they will use > is Dataflow. Let me briefly the usecase specification: > > They have kafka cluster where data from CDC data source is stored. The > data in the topics is stored as Avro format. Their other requirement is > they want to have a streaming solution reading from these Kafka topics, and > writing to the Google Cloud Storage again in Avro. What’s more, the > component should be written in Python, since their Data Engineers heavily > prefer Python instead of Java. > > > > We’ve been struggling with the design of the solution for couple of weeks > now, and we’re facing quite unfortunate situation now, not really finding > any solution that would fit these requirements. > > > > So the question is: Is there any existing Dataflow template/solution with > the following specifications: > > - Streaming connector > - Written in Python > - Consumes from Kafka topics > - Reads Avro with Schema Registry > - Writes Avro to GCS > > > > We found out, that current sinks for Avro and Parquet with the destination > of GCS are not supported for Python at the moment, which is basically the > main blocker now. > > > > Any recommendations/suggestions would be really highly appreciated! > > > > Maybe the solution really does not exist and we need to create our own > custom connector for it. The question in this case would be if that’s even > possible theoretically, since we would really need to avoid another dead > end. > > > > Thanks a lot for any help! > > > > Kind regards, > > Ondrej >