Thanks for the feedback. On Wed, Jun 8, 2016 at 12:35 PM, Jesse Anderson <[email protected]> wrote:
> The second thing is about the JavaDoc on the read method. I think the > JavaDocs should talk about the differences between this and a default > KafkaConsumer. Since there isn't a group.id required to be set, I think > the JavaDoc should mention the implications of this. It would say something > about always starting from the latest data. Also, it should mention that > offsets won't be kept and restarting the processing will not start at the > last offset; it will start at the latest data again. > JavaDoc currently states this <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L156> about starting from latest offset : When the pipeline starts for the first time without any checkpoint, the source starts consuming from the <em>latest</em> offsets. You can override this behavior to consume from the beginning by setting appropriate appropriate properties in {@link ConsumerConfig}, through {@link Read#updateConsumerProperties(Map)} It says by default it reads from the latest position. I kind of left very Kafka specific details (e.g. actual Kafka configuration you need to set to start from the beginning, or to enable kafka auto_commit which would provide an (external) soft checkpoint of read input location). KafkaIO gives full control on configuration you can pass to KafkaConsumer (see "Advanced Kafka Configuration <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L198>" section in JavaDoc). I didn't explicitly mention implications of 'starting a job' vs 'updating a job', which is a Beam/Dataflow feature common to all sources. Update starts from checkpoint (ensures KakfaIO resumes reading from where it left off), but fresh start does not have any checkpointed state, so KafkaIO reads from latest offset (by default). It would be better to call this out explicitly as many new users may not aware of it. Please suggest any other improvements.
