alxp1982 commented on code in PR #25301: URL: https://github.com/apache/beam/pull/25301#discussion_r1152509790
########## learning/tour-of-beam/learning-content/IO/kafka-io/kafka-read/description.md: ########## @@ -0,0 +1,72 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +### Reading from Kafka using KafkaIO + +`KafkaIO` is a part of the Apache Beam SDK that provides a way to read data from Apache Kafka and write data to it. It allows for the creation of Beam pipelines that can consume data from a Kafka topic, process the data and write the processed data back to another Kafka topic. This makes it possible to build data processing pipelines using Apache Beam that can easily integrate with a Kafka-based data architecture. + +When reading data from Kafka topics using Apache Beam, developers can use the `ReadFromKafka` transform to create a `PCollection` of Kafka messages. This transform takes the following parameters: + +When the `ReadFromKafka` transform is executed, it creates a `PCollection` of Kafka messages, where each message is represented as a tuple containing the key, value, and metadata fields. If the with_metadata flag is set to True, the metadata fields are included in the tuple as well. + +Developers can then use other Apache Beam transforms to process and analyze the Kafka messages, such as filtering, aggregating, and joining them with other data sources. Once the data processing pipeline is defined, it can be executed on a distributed processing engine, such as **Apache Flink**, **Apache Spark**, or **Google Cloud Dataflow**, to process the Kafka messages in parallel and at scale. + Review Comment: Move below parameters description up after 'This transform takes the following parameters' ########## learning/tour-of-beam/learning-content/IO/kafka-io/kafka-write/description.md: ########## @@ -13,8 +13,17 @@ limitations under the License. --> ### Writing to Kafka using KafkaIO -`KafkaIO` is a part of the Apache Beam SDK that provides a way to read data from Apache Kafka and write data to it. It allows for the creation of Beam pipelines that can consume data from a Kafka topic, process the data and write the processed data back to another Kafka topic. This makes it possible to build data processing pipelines using Apache Beam that can easily integrate with a Kafka-based data architecture. +When writing data processing pipelines using Apache Beam, developers can use the `WriteToKafka` transform to write data to Kafka topics. This transform takes a PCollection of data as input and writes the data to a specified Kafka topic using a Kafka producer. +To use the `WriteToKafka` transform, developers need to provide the following parameters: + +* **producer_config**: a dictionary that contains the Kafka producer configuration properties, such as the Kafka broker addresses and the number of acknowledgments to wait for before considering a message as sent. +* **bootstrap.servers**: is a configuration property in Apache Kafka that specifies the list of bootstrap servers that the Kafka clients should use to connect to the Kafka cluster. +* **topic**: the name of the Kafka topic to write the data to. +* **key**: a function that takes an element from the input PCollection and returns the key to use for the Kafka message. The key is optional and can be None. +* **value**: a function that takes an element from the input PCollection and returns the value to use for the Kafka message. + +When writing data to Kafka using Apache Beam, it is important to ensure that the pipeline is fault-tolerant and can handle failures, such as network errors, broker failures, or message serialization errors. Apache Beam provides features such as checkpointing, retries, and dead-letter queues to help developers build robust and reliable data processing pipelines that can handle these types of failures. Review Comment: Perhaps let's add a link to Beam documentation where more information can be found about these topics. ########## learning/tour-of-beam/learning-content/IO/kafka-io/kafka-read/description.md: ########## @@ -0,0 +1,72 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +### Reading from Kafka using KafkaIO + +`KafkaIO` is a part of the Apache Beam SDK that provides a way to read data from Apache Kafka and write data to it. It allows for the creation of Beam pipelines that can consume data from a Kafka topic, process the data and write the processed data back to another Kafka topic. This makes it possible to build data processing pipelines using Apache Beam that can easily integrate with a Kafka-based data architecture. + +When reading data from Kafka topics using Apache Beam, developers can use the `ReadFromKafka` transform to create a `PCollection` of Kafka messages. This transform takes the following parameters: + +When the `ReadFromKafka` transform is executed, it creates a `PCollection` of Kafka messages, where each message is represented as a tuple containing the key, value, and metadata fields. If the with_metadata flag is set to True, the metadata fields are included in the tuple as well. Review Comment: `ReadFromKafka` transform returns unbounded `PCollection` of Kafka messages, where each element contains the key, value, and basic metadata such as topic-partition and offset. ########## learning/tour-of-beam/learning-content/IO/text-io/text-io-gcs-write/description.md: ########## @@ -80,4 +80,8 @@ It is important to note that in order to interact with **GCS** you will need to ``` --tempLocation=gs://my-bucket/temp ``` -{{end}} \ No newline at end of file +{{end}} + +### Playground exercise + +You can write PCollection to a file. Use DoFn to generate numbers and write them down by filtering. Review Comment: Please follow the pattern e.g. first give a small description of what the playground example is doing and then give a challenge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
