Custom Offset Management

2016-09-09 Thread Daniel Fagnan
I’m currently wondering if it’s possible to use the internal 
`__consumer_offsets` topic to manage offsets outside the consumer group APIs. 
I’m using the low-level API to manage the consumers but I’d still like to store 
offsets in Kafka.

If it’s not possible to publish and fetch offsets from the internal topic, 
would a separate compacted log replicate most of the functionality?

Thanks,
Daniel


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Large # of Topics/Partitions

2016-08-08 Thread Daniel Fagnan
Thanks Tom! This was very helpful and I’ll explore having a more static set of 
partitions as that seems to fit Kafka a lot better.

Cheers,
Daniel

> On Aug 8, 2016, at 12:27 PM, Tom Crayford <tcrayf...@heroku.com> wrote:
> 
> Hi Daniel,
> 
> Kafka doesn't provide this kind of isolation or scalability for many many
> streams. The usual design is to use a consistent hash of some "key" to
> attribute your data to a particular partition. That of course, doesn't
> isolate things fully, but has everything in a partition dependent on each
> other.
> 
> We've found that over a few thousand to a few tens of thousands of
> partitions clusters hit a lot of issues (it depends on the write pattern,
> how much memory you give brokers and zookeeper, and if you plan on ever
> deleting topics).
> 
> Another option is to manage multiple clusters, and keep under a certain
> limit of partitions in each cluster. That is of course additional
> operational overhead and complexity.
> 
> I'm not sure I 100% understand your mechanism for tracking pending offsets,
> but it seems like that might be your best option.
> 
> Thanks
> 
> Tom Crayford
> Heroku Kafka
> 
> On Mon, Aug 8, 2016 at 8:12 PM, Daniel Fagnan <dan...@segment.com> wrote:
> 
>> Hey all,
>> 
>> I’m currently in the process of designing a system around Kafka and I’m
>> wondering the recommended way to manage topics. Each event stream we have
>> needs to be isolated from each other. A failure from one should not affect
>> another event stream from processing (by failure, we mean a downstream
>> failure that would require us to replay the messages).
>> 
>> So my first thought was to create a topic per event stream. This allows a
>> larger event stream to be partitioned for added parallelism but keep the
>> default # of partitions down as much as possible. This would solve the
>> isolation requirement in that a topic can keep failing and we’ll continue
>> replaying the messages without affected all the other topics.
>> 
>> We read it’s not recommended to have your data model dictate the # of
>> partitions or topics in Kafka and we’re unsure about this approach if we
>> need to triple our event stream.
>> 
>> We’re currently looking at 10,000 event streams (or topics) but we don’t
>> want to be spinning up additional brokers just so we can add more event
>> stream, especially if the load for each is reasonable.
>> 
>> Another option we were looking into was to not isolate at the
>> topic/partition level but to keep a set of pending offsets persisted
>> somewhere (seemingly what Twitter Heron or Storm does but they don’t seem
>> to persist the pending offsets).
>> 
>> Thoughts?



Large # of Topics/Partitions

2016-08-08 Thread Daniel Fagnan
Hey all,

I’m currently in the process of designing a system around Kafka and I’m 
wondering the recommended way to manage topics. Each event stream we have needs 
to be isolated from each other. A failure from one should not affect another 
event stream from processing (by failure, we mean a downstream failure that 
would require us to replay the messages).

So my first thought was to create a topic per event stream. This allows a 
larger event stream to be partitioned for added parallelism but keep the 
default # of partitions down as much as possible. This would solve the 
isolation requirement in that a topic can keep failing and we’ll continue 
replaying the messages without affected all the other topics.

We read it’s not recommended to have your data model dictate the # of 
partitions or topics in Kafka and we’re unsure about this approach if we need 
to triple our event stream.

We’re currently looking at 10,000 event streams (or topics) but we don’t want 
to be spinning up additional brokers just so we can add more event stream, 
especially if the load for each is reasonable.

Another option we were looking into was to not isolate at the topic/partition 
level but to keep a set of pending offsets persisted somewhere (seemingly what 
Twitter Heron or Storm does but they don’t seem to persist the pending offsets).

Thoughts?