Hey, Chris: 1. Yes, batch size set to 1: systems.kafka.producer.batch.num.messages=1 2. Building default chooser with: useBatching=false, useBootstrapping=false, usePriority=true 3. I just re-ran my test and the delay is about 16 seconds. When I’m really pumping data through the batch Kafka topics, I’ve seen around 1 minute.
Thanks! —T On Jul 15, 2014, at 10:11 AM, Chris Riccomini <[email protected]> wrote: > Hey TJ, > > This sounds like a bug. > > 1. Is the task.consumer.batch.size config set for your job? > 2. What does this log line say: Building default chooser with: > useBatching=%s, useBootstrapping=%s, usePriority=%s > 3. How long is the delay that you're seeing? > > Cheers, > Chris > > On 7/14/14 4:57 PM, "Yan Fang" <[email protected]> wrote: > >> It seems for me that, the batch processing fetches many messages at one >> time and then takes too long time to process. My first thought is that, >> since we have the systems.system-name.samza.fetch.threshold >> <http://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/configur >> ation-table.html> >> , >> setting this number to a smaller (default is 50000) will force the system >> to fetch the messages more frequently. >> >> Any other ideas? >> >> Fang, Yan >> [email protected] >> +1 (206) 849-4108 >> >> >> On Mon, Jul 14, 2014 at 1:12 PM, TJ Giuli <[email protected]> >> wrote: >> >>> Hi, >>> >>> I have a stream processor that takes inputs from multiple streams, some >>> are more batch, non-latency sensitive and others are real-time, >>> infrequently have traffic and should be low-latency. The real-time >>> stream >>> helps me interpret the batch stream, so I would ideally like any >>> real-time >>> stream envelopes delivered within some maximum latency from the time the >>> message enters into a Kafka topic. >>> >>> I have my stream processor configured to prioritize my real-time streams >>> over the batch streams, but I consistently find that the real-time >>> stream >>> is delayed by traffic from the batch stream. From tracing the Kafka >>> consumer, it looks like my stream processor periodically fetches from >>> Kafka, finds that the batch streams have a large chunk of messages >>> waiting, >>> doesn¹t find anything on the real-time topics, and processes away the >>> batch >>> messages for a few minutes. During the batch processing, the Kafka >>> consumer does not poll the real-time streams, so if a message is sent >>> to a >>> real-time topic, the message effectively doesn¹t arrive until the next >>> time >>> the Kafka consumer does another fetch. When a real-time message is >>> consumed by the Kafka consumer, the TieredPriorityChooser correctly >>> prioritizes traffic from the real-time streams over the batch streams. >>> >>> Does anyone have recommendations on how to get infrequent but important >>> messages within some maximum time bound? Thanks! >>> ‹T >
