Why do you need to read it every 6 minutes? Why not just read it as it arrives? If it naturally arrives in 6 minute bursts, you'll read it in 6 minute bursts, no?
Perhaps the data does not have timestamps embedded in it, so that is why you are relying on time-based topic names? In that case I would have an intermediate stage that tags the data with the timestamp, and then writes it to a single topic, and then processes it at your leisure in a third stage. Perhaps I am still missing a key difficulty with your system. Your original suggestion is going to be difficult to get working. You'll quickly run out of file descriptors, amongst other issues. Philip --------------------------------- http://www.philipotoole.com > On Aug 11, 2014, at 6:42 PM, Chen Wang <chen.apache.s...@gmail.com> wrote: > > "And if you can't consume it all within 6 minutes, partition the topic > until you can run enough consumers such that you can keep up.", this is > what I intend to do for each 6min -topic. > > What I really need is a partitioned queue: each 6 minute of data can put > into a separate partition, so that I can read that specific partition at > the end of each 6 minutes. So apparently redis naturally fit this case, but > the only issue is the performance,(well also some trick in ensuring the > reliable message delivery). As I said, we have kakfa infrastructure in > place, if without too much work, i can make the design work with kafka, i > would rather go this path instead of setting up another queue system. > > Chen > > Chen > > > On Mon, Aug 11, 2014 at 6:07 PM, Philip O'Toole < > philip.oto...@yahoo.com.invalid> wrote: > >> It's still not clear to me why you need to create so many topics. >> >> Write the data to a single topic and consume it when it arrives. It >> doesn't matter if it arrives in bursts, as long as you can process it all >> within 6 minutes, right? >> >> And if you can't consume it all within 6 minutes, partition the topic >> until you can run enough consumers such that you can keep up. The fact that >> you are thinking about so many topics is a sign your design is wrong, or >> Kafka is the wrong solution. >> >> Philip >> >>>> On Aug 11, 2014, at 5:18 PM, Chen Wang <chen.apache.s...@gmail.com> >>> wrote: >>> >>> Philip, >>> That is right. There is huge amount of data flushed into the topic >> within each 6 minutes. Then at the end of each 6 min, I only want to read >> from that specify topic, and data within that topic has to be processed as >> fast as possible. I was originally using redis queue for this purpose, but >> it takes much longer to process a redis queue than kafka queue(testing data >> is 2M messages). Since we already have kafka infrastructure setup, instead >> of seeking other tools(activeMQ, rabbitMQ etc), I would rather make use of >> kafka, although it does not seem like a common kafka user case. >>> >>> Chen >>> >>> >>>> On Mon, Aug 11, 2014 at 5:01 PM, Philip O'Toole >> <philip.oto...@yahoo.com.invalid> wrote: >>>> I'd love to know more about what you're trying to do here. It sounds >> like you're trying to create topics on a schedule, trying to make it easy >> to locate data for a given time range? I'm not sure it makes sense to use >> Kafka in this manner. >>>> >>>> Can you provide more detail? >>>> >>>> >>>> Philip >>>> >>>> >>>> ----------------------------------------- >>>> http://www.philipotoole.com >>>> >>>> >>>> On Monday, August 11, 2014 4:45 PM, Chen Wang < >> chen.apache.s...@gmail.com> wrote: >>>> >>>> >>>> >>>> Todd, >>>> I actually only intend to keep each topic valid for 3 days most. Each of >>>> our topic has 3 partitions, so its around 3*240*3 =2160 partitions. >> Since >>>> there is no api for deleting topic, i guess i could set up a cron job >>>> deleting the out dated topics(folders) from zookeeper.. >>>> do you know when the delete topic api will be available in kafka? >>>> Chen >>>> >>>> >>>> >>>> On Mon, Aug 11, 2014 at 3:47 PM, Todd Palino >> <tpal...@linkedin.com.invalid> >>>> wrote: >>>> >>>>> You need to consider your total partition count as you do this. After >> 30 >>>>> days, assuming 1 partition per topic, you have 7200 partitions. >> Depending >>>>> on how many brokers you have, this can start to be a problem. We just >>>>> found an issue on one of our clusters that has over 70k partitions >> that >>>>> there¹s now a problem with doing actions like a preferred replica >> election >>>>> for all topics because the JSON object that gets written to the >> zookeeper >>>>> node to trigger it is too large for Zookeeper¹s default 1 MB data >> size. >>>>> >>>>> You also need to think about the number of open file handles. Even >> with no >>>>> data, there will be open files for each topic. >>>>> >>>>> -Todd >>>>> >>>>> >>>>>> On 8/11/14, 2:19 PM, "Chen Wang" <chen.apache.s...@gmail.com> wrote: >>>>>> >>>>>> Folks, >>>>>> Is there any potential issue with creating 240 topics every day? >> Although >>>>>> the retention of each topic is set to be 2 days, I am a little >> concerned >>>>>> that since right now there is no delete topic api, the zookeepers >> might be >>>>>> overloaded. >>>>>> Thanks, >>>>>> Chen >>