James, You can reference confluent IO schema registry implementation. http://docs.confluent.io/1.0/schema-registry/docs/index.html
It does similar thing as what you described. A REST front end that serves data from a compacted topic and HA is also provided in the solution. On Tue, 21 Jul 2015 at 09:25 James Cheng <jch...@tivo.com> wrote: > Hi, > > I have a web service that serves up some data that it obtains from a kafka > topic. When the process starts up, it wants to load the entire kafka topic > into memory, and serve the data up from an in-memory hashtable. The data in > the topic has primary keys and is log compacted, and so the total dataset > will be small enough to fit in memory. My web service will only start > serving up data when the entire topic is loaded. (And for that, > https://issues.apache.org/jira/browse/KAFKA-1977 would be super useful). > > I am only storing this data in memory. In the event of process death or > restart, my in-memory state is gone, and so I will always want to rebuild > it by again consuming the topic from the earliest offset. I will never need > to checkpoint my offsets. > > Also, I will have N instances of this application, each one needing to > consume the entire topic. This is how I plan to do horizontal scaling of my > web service. > > I would like to use the high level consumer, so that I don't need to > manually discover which broker is the leader, and so that I don't have to > handle leader rebalancing. > > A couple questions: > 1) Does this use case make sense? Is this pattern used by anyone else? I > like it because it makes my web service completely stateless. > 2) In order to make each instance consume all partitions of the topic, I > need each consumer group id to be unique to that process. So I was thinking > of just using a UUID or something similar. What is the "cost" of creating a > new consumer group id? If I am creating a new one every time I start my > application, would I be cluttering up zookeeper or the __consumer_offsets > topic? Note there will only every be N instances of my application running. > Since I never will need to checkpoint my offsets, does that affect my > question about "cluttering up" zookeeper/kafka? Are old consumer groups > ever cleaned up out of zookeeper or the __consumer_offsets topic? > 3) Are the stored offsets used for any other reason, aside from at startup > of a new consumer? Are offsets used after rebalancing when partition > leaders change due to broker failure? I know that offsets can be used for > Burrow-like monitoring. > 4) Since I don't need for support checkpointing, another option is to use > the SimpleConsumer. The sample code at > https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example > looks fairly comprehensive. It handles discovery of the partition leader, > and handles leader rebalancing. Are there any other situations that I > should be aware of before relying on that sample code? > 5) Will any of this change when the new consumer comes out? Will the > SimpleConsumer still exist when the new consumer comes out? > > Thanks, > -James > >