Re: Issue with 240 topics per day

Philip O'Toole Mon, 11 Aug 2014 18:55:25 -0700

Why do you need to read it every 6 minutes? Why not just read it as it arrives? 
If it naturally arrives in 6 minute bursts, you'll read it in 6 minute bursts, 
no?


Perhaps the data does not have timestamps embedded in it, so that is why you 
are relying on time-based topic names? In that case I would have an 
intermediate stage that tags the data with the timestamp, and then writes it to 
a single topic, and then processes it at your leisure in a third stage.

Perhaps I am still missing a key difficulty with your system.

Your original suggestion is going to be difficult to get working. You'll 
quickly run out of file descriptors, amongst other issues.

Philip




---------------------------------
http://www.philipotoole.com

> On Aug 11, 2014, at 6:42 PM, Chen Wang <chen.apache.s...@gmail.com> wrote:
> 
> "And if you can't consume it all within 6 minutes, partition the topic
> until you can run enough consumers such that you can keep up.", this is
> what I intend to do for each 6min -topic.
> 
> What I really need is a partitioned queue: each 6 minute of data can put
> into a separate partition, so that I can read that specific partition at
> the end of each 6 minutes. So apparently redis naturally fit this case, but
> the only issue is the performance,(well also some trick in ensuring the
> reliable message delivery). As I said, we have kakfa infrastructure in
> place, if without too much work, i can make the design work with kafka, i
> would rather go this path instead of setting up another queue system.
> 
> Chen
> 
> Chen
> 
> 
> On Mon, Aug 11, 2014 at 6:07 PM, Philip O'Toole <
> philip.oto...@yahoo.com.invalid> wrote:
> 
>> It's still not clear to me why you need to create so many topics.
>> 
>> Write the data to a single topic and consume it when it arrives. It
>> doesn't matter if it arrives in bursts, as long as you can process it all
>> within 6 minutes, right?
>> 
>> And if you can't consume it all within 6 minutes, partition the topic
>> until you can run enough consumers such that you can keep up. The fact that
>> you are thinking about so many topics is a sign your design is wrong, or
>> Kafka is the wrong solution.
>> 
>> Philip
>> 
>>>> On Aug 11, 2014, at 5:18 PM, Chen Wang <chen.apache.s...@gmail.com>
>>> wrote:
>>> 
>>> Philip,
>>> That is right. There is huge amount of data flushed into the topic
>> within each 6 minutes. Then at the end of each 6 min, I only want to read
>> from that specify topic, and data within that topic has to be processed as
>> fast as possible. I was originally using redis queue for this purpose, but
>> it takes much longer to process a redis queue than kafka queue(testing data
>> is 2M messages). Since we already have kafka infrastructure setup, instead
>> of seeking other tools(activeMQ, rabbitMQ etc), I would rather make use of
>> kafka, although it does not seem like a common kafka user case.
>>> 
>>> Chen
>>> 
>>> 
>>>> On Mon, Aug 11, 2014 at 5:01 PM, Philip O'Toole
>> <philip.oto...@yahoo.com.invalid> wrote:
>>>> I'd love to know more about what you're trying to do here. It sounds
>> like you're trying to create topics on a schedule, trying to make it easy
>> to locate data for a given time range? I'm not sure it makes sense to use
>> Kafka in this manner.
>>>> 
>>>> Can you provide more detail?
>>>> 
>>>> 
>>>> Philip
>>>> 
>>>> 
>>>> -----------------------------------------
>>>> http://www.philipotoole.com
>>>> 
>>>> 
>>>> On Monday, August 11, 2014 4:45 PM, Chen Wang <
>> chen.apache.s...@gmail.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> Todd,
>>>> I actually only intend to keep each topic valid for 3 days most. Each of
>>>> our topic has 3 partitions, so its around 3*240*3 =2160 partitions.
>> Since
>>>> there is no api for deleting topic, i guess i could set up a cron job
>>>> deleting the out dated topics(folders) from zookeeper..
>>>> do you know when the delete topic api will be available in kafka?
>>>> Chen
>>>> 
>>>> 
>>>> 
>>>> On Mon, Aug 11, 2014 at 3:47 PM, Todd Palino
>> <tpal...@linkedin.com.invalid>
>>>> wrote:
>>>> 
>>>>> You need to consider your total partition count as you do this. After
>> 30
>>>>> days, assuming 1 partition per topic, you have 7200 partitions.
>> Depending
>>>>> on how many brokers you have, this can start to be a problem. We just
>>>>> found an issue on one of our clusters that has over 70k partitions
>> that
>>>>> there¹s now a problem with doing actions like a preferred replica
>> election
>>>>> for all topics because the JSON object that gets written to the
>> zookeeper
>>>>> node to trigger it is too large for Zookeeper¹s default 1 MB data
>> size.
>>>>> 
>>>>> You also need to think about the number of open file handles. Even
>> with no
>>>>> data, there will be open files for each topic.
>>>>> 
>>>>> -Todd
>>>>> 
>>>>> 
>>>>>> On 8/11/14, 2:19 PM, "Chen Wang" <chen.apache.s...@gmail.com> wrote:
>>>>>> 
>>>>>> Folks,
>>>>>> Is there any potential issue with creating 240 topics every day?
>> Although
>>>>>> the retention of each topic is set to be 2 days, I am a little
>> concerned
>>>>>> that since right now there is no delete topic api, the zookeepers
>> might be
>>>>>> overloaded.
>>>>>> Thanks,
>>>>>> Chen
>>

Re: Issue with 240 topics per day

Reply via email to