[ https://issues.apache.org/jira/browse/KAFKA-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722635#comment-16722635 ]
Richard Yu edited comment on KAFKA-7432 at 12/17/18 1:59 AM: ------------------------------------------------------------- Hi, just want to point out something here. What Kafka currently supports is continuous processing, which Spark Streaming most recently implemented. In contrast, what this ticket is suggesting to implement is microbatch processing in which data is sent in batches. In some data streaming circles, continuous processing is considered the best option for sending data. Microbatching was an older technique. I don't know if we need to implement this particular option, especially since latency overall for microbatching is higher than continuous processing. Spark is moving from microbatch processing to continuous largely because of latency improvements. So with what Kafka has right now, this ticket probably wouldn't be necessary. was (Author: yohan123): Hi, just want to point out something here. What Kafka currently supports is continuous processing, which Spark Streaming most recently implemented. In contrast, what this ticket is suggesting to implement is microbatch processing in which data is sent in batches. In some data streaming circles, continuous processing is considered the best option for sending data. Microbatching was an older technique. I don't know if we need to implement this particular option, especially since latency overall for microbatching is higher than continuous processing. Spark is largely moving from microbatch processing to continuous largely because of latency improvements. So with what Kafka has right now, this ticket probably wouldn't be necessary. > API Method on Kafka Streams for processing chunks/batches of data > ----------------------------------------------------------------- > > Key: KAFKA-7432 > URL: https://issues.apache.org/jira/browse/KAFKA-7432 > Project: Kafka > Issue Type: New Feature > Components: streams > Reporter: sam > Priority: Major > > For many situations in Big Data it is preferable to work with a small buffer > of records at a go, rather than one record at a time. > The natural example is calling some external API that supports batching for > efficiency. > How can we do this in Kafka Streams? I cannot find anything in the API that > looks like what I want. > So far I have: > {{builder.stream[String, String]("my-input-topic") > .mapValues(externalApiCall).to("my-output-topic")}} > What I want is: > {{builder.stream[String, String]("my-input-topic") .batched(chunkSize = > 2000).map(externalBatchedApiCall).to("my-output-topic")}} > In Scala and Akka Streams the function is called {{grouped}} or {{batch}}. In > Spark Structured Streaming we can do > {{mapPartitions.map(_.grouped(2000).map(externalBatchedApiCall))}}. > > > https://stackoverflow.com/questions/52366623/how-to-process-data-in-chunks-batches-with-kafka-streams -- This message was sent by Atlassian JIRA (v7.6.3#76005)