I think this is a good point. I think it might be good to just take the .net
patch now if it works, but i don't think that is a great long-term strategy.
The consumer is too complicated right now to maintain it simultaneously in
every language.

Let me give a couple of options I don't think are great:

1. Using thrift or protocol buffers doesn't really solve the problem. The
problem we have isn't our serialized wire format it is the complex code
needed in the consumer to give a simplified API to the user. Even if we used
protocol buffers it wouldn't really help thin the client, it would just
replace our request/response objects with new generated ones and add a new
runtime dependency for users. We did this for voldemort by adding optional
protocol buffer definitions for all the requests and in the end no one used
protocol buffers because a simple wire format was just as easy to use and
protocol buffers supported so few languages. Changing our request/response
objects might be a good idea for other reasons, it just doesn't help with
this.

2. Trying to adopt an off-the-shelf protocol. We really do want to give good
performance and reasonable distributed semantics. I looked at a few of these
generic protocols and I think they essentially imply a particular
implementation. Actually I think they imply the union of the implementations
of all the systems involved in the standardization process :-) Worse I think
they don't really solve the harder problems of balancing consumer load.

One option that would be okay would be creating a simple RESTFUL proxy that
consumed into a buffer, handed out the messages in the buffer on request and
committed the offset at the same time. This is inefficient, and would not
allow semantic partitioning, but would make the simple case simple and is
easy to implement as a standalone module.

For a slightly more long-term solution, here is my thought:

The fat scala/jvm client does a couple of hard things: (1) co-ordination,
(2), multithreading, including a non-blocking fetch with a queue in the
middle to buffer consumption, and (3) handling many topics and/or
multithreaded consumption from the same socket and fetcher pool.

Here is how we could remove the co-ordination zookeeper code from the
client. What if we just made a PartitionAssignmentRequest and moved that
whole chunk of logic about who gets what to the server? What if instead of
the clients co-ordinating to choose consumers, the clients just registered
themselves in zookeeper, watched the other consumers and brokers, and
responded to any state change by re-requesting partition assignments from
the servers? Actually you might be able to simplify this further by just
having the consumers register, and having the server disconnect them
whenever they need partition assignments (at which point they would respond
by getting new assignments and connecting to those). The brokers would pick
a master, which would put its node id in zk, and consumers would use this to
make the request for partitions. I think centralizing the logic might make
fancier locality-aware partition assignments easier to implement and debug
too. Now the only requirement on the client is that it be able to register
in zookeeper, which should be pretty easy in most languages (C, python,
etc).

The next two "hard" pieces I think that is easy, just don't do them. I think
our current consumer is very good, and a good fit for the thread-centric
model of the jvm, but for non jvm langs just single threading the consumer
is preferable. I think a better api for these cases would be a non-blocking
select on all the open broker connections which is done inline as part of
the iterator (e.g. when you are computing next() if you are out of data,
just check your socket buffers). Since the select is non-blocking doing this
inline should not be an issue at all. I think the best approach would be to
implement this in C and then just wrap it for other languages, but doing it
in python or ruby or whatever would likely be a pretty small amount of code
too. We would not try to share a single connection per node, we would just
have one per topic-iterator, which is not ideal but probably fine and
greatly simplifies everything.

Cheers,

-Jay

On Wed, Sep 7, 2011 at 9:21 AM, Jun Rao <[email protected]> wrote:

> KAFKA-85 raised a good question: what's the right approach to support
> client
> bindings for languages other than java? I don't have a perfect answer and
> would like to start a discussion and let everybody weigh in.
>
> The approach that KAFKA-85 took is to re-write all the logic in our fat
> client (both the producer and the consumer) in C#. This means that a lot of
> code has to be re-written and maintained and it's a lot of work if every
> language does the same thing.
>
> There are 2 other approaches that some Apache projects have used to support
> different language binding. The first one is to use an RPC code generator
> to
> directly expose the api to other languages. For example, Cassandra uses
> Thrift to define the client API and let Thrift generate language specific
> client code to talk to server. This approach works well for thin clients.
> In
> Cassandra, the client only does serializing/deserializing
> requests/responses
> and the complex routing logic is on the server. This approach may not work
> well with Kafka since our client is relatively fat (lots of code in
> handling
> both the produce and the consume request in the client library).
>
> The second approach is to have a gateway. For example, HBase also has a
> relatively fat java client. To support other languages, it exposes its api
> indirectly in a java gateway. The gateway api is compiled into different
> languages using Thrift. The generated gateway client code is thin and all
> the complicated routing logic is in the gateway itself. The downside is
> that
> this adds the complexity of maintaining the gateway and adds one extra hop
> between the client and the server. Setting these 2 concerns aside, this
> approach probably works well with Kafka producers. However, it's not clear
> how this works with the consumers since they get data continuously.
>
> Does anyone know how other queuing systems (activemq, rabbitmq, etc)
> support
> non-java clients?
>
> Thanks,
>
> Jun
>

Reply via email to