Re: [VOTE] KIP-67: Queryable state for Kafka Streams

Ismael Juma Tue, 12 Jul 2016 14:12:26 -0700

Hi Damian,

How about StreamsMetadata instead? The general naming pattern seems to
avoid the `Kafka` prefix for everything outside of `KafkaStreams` itself.


Ismael

On Tue, Jul 12, 2016 at 7:14 PM, Damian Guy <[email protected]> wrote:

> Hi,
>
> I agree with point 1. application.server is a better name for the config
> (we'll change this). However, on point 2 I think we should stick mostly
> with what we already have. I've tried both ways of doing this when working
> on the JIRA and building examples and I find the current approach more
> intuitive and easier to use than the Map based approach.
> However, there is probably a naming issue. We should rename
> KafkaStreamsInstance to KafkaStreamsMetadata. This Class is very simple,
> but provides all the information a developer needs to be able to find the
> instance(s) of a Streams application that a particular store is running on,
> i.e.,
>
> public class KafkStreamsMetadata {
>     private final HostInfo hostInfo;
>     private final Set<String> stateStoreNames;
>     private final Set<TopicPartition> topicPartitions;
>
>
> So using the API to route to a new host is fairly simple, particularly in
> the case when you want to find the host for a particular key, i.e.,
>
> final KafkaStreams kafkaStreams = createKafkaStreams();
> final KafkaStreamsMetadata streamsMetadata =
> kafkaStreams.instanceWithKey("word-count", "hello",
> Serdes.String().serializer());
> http.get("http://"; + streamsMetadata.host() + ":" +
> streamsMetadata.port() + "/get/word-count/hello");
>
>
> And if you want to do a scatter gather approach:
>
> final KafkaStreams kafkaStreams = createKafkaStreams();
> final Collection<KafkaStreamsMetadata> kafkaStreamsMetadatas =
> kafkaStreams.allInstancesWithStore("word-count");
> for (KafkaStreamsMetadata streamsMetadata : kafkaStreamsMetadatas) {
>     http.get("http://"; + streamsMetadata.host() + ":" +
> streamsMetadata.port() + "/get/word-count/hello");
>     ...
> }
>
>
> And if you iterated over all instances:
>
> final KafkaStreams kafkaStreams = createKafkaStreams();
> final Collection<KafkaStreamsMetadata> kafkaStreamsMetadatas =
> kafkaStreams.allInstances();
> for (KafkaStreamsMetadata streamsMetadata : kafkaStreamsMetadatas) {
>     if (streamsMetadata.stateStoreNames().contains("word-count")) {
>         http.get("http://"; + streamsMetadata.host() + ":" +
> streamsMetadata.port() + "/get/word-count/hello");
>         ...
>     }
> }
>
>
> If we were to change this to use Map<HostInfo, Set<TaskMetadata>> for the
> most part users would need to iterate over the entry or key set. Examples:
>
> The finding an instance by key is a little odd:
>
> final KafkaStreams kafkaStreams = createKafkaStreams();
> final Map<HostInfo, Set<TaskMetadata>> streamsMetadata =
> kafkaStreams.instanceWithKey("word-count","hello",
> Serdes.String().serializer());
> // this is a bit odd as i only expect one:
> for (HostInfo hostInfo : streamsMetadata.keySet()) {
>     http.get("http://"; + streamsMetadata.host() + ":" +
> streamsMetadata.port() + "/get/word-count/hello");
> }
>
>
> The scatter/gather by store is fairly similar to the previous example:
>
> final KafkaStreams kafkaStreams = createKafkaStreams();
> final Map<HostInfo, Set<TaskMetadata>> streamsMetadata =
> kafkaStreams.allInstancesWithStore("word-count");
> for(HostInfo hostInfo : streamsMetadata.keySet()) {
>     http.get("http://"; + hostInfo.host() + ":" + hostInfo.port() +
> "/get/word-count/hello");
>     ...
> }
>
> And iterating over all instances:
>
> final Map<HostInfo, Set<TaskMetadata>> streamsMetadata =
> kafkaStreams.allInstances();
> for (Map.Entry<HostInfo, Set<TaskMetadata>> entry :
> streamsMetadata.entrySet()) {
>     for (TaskMetadata taskMetadata : entry.getValue()) {
>         if (taskMetadata.stateStoreNames().contains("word-count")) {
>             http.get("http://"; + streamsMetadata.host() + ":" +
> streamsMetadata.port() + "/get/word-count/hello");
>             ...
>         }
>     }
> }
>
>
> IMO - having a class we return is the better approach as it nicely wraps
> the related things, i.e, host:port, store names, topic partitions into an
> Object that is easy to use. Further we could add some behaviour to this
> class if we felt it necessary, i.e, hasStore(storeName) etc.
>
> Anyway, i'm interested in your thoughts.
>
> Thanks,
> Damian
>
> On Mon, 11 Jul 2016 at 13:47 Guozhang Wang <[email protected]> wrote:
>
> > 1. Re StreamsConfig.USER_ENDPOINT_CONFIG:
> >
> > I agree with Neha that Kafka Streams can provide the bare minimum APIs
> just
> > for host/port, and user's implemented layer can provide URL / proxy
> address
> > they want to build on top of it.
> >
> >
> > 2. Re Improving KafkaStreamsInstance interface:
> >
> > Users are indeed aware of "TaskId" class which is not part of internal
> > packages and is exposed in PartitionGrouper interface that can be
> > instantiated by the users, which is assigned with input topic partitions.
> > So we can probably change the APIs as:
> >
> > Map<HostState, Set<TaskMetadata>> KafkaStreams.getAllTasks() where
> > TaskMetadata has fields such as taskId, list of assigned partitions, list
> > of state store names; and HostState can include hostname / port. The port
> > is the listening port of a user-defined listener that users provide to
> > listen for queries (e.g., using REST APIs).
> >
> > Map<HostState, Set<TaskMetadata>> KafkaStreams.getTasksWithStore(String
> /*
> > storeName */) would return only the hosts and their assigned tasks if at
> > least one of the tasks include the given store name.
> >
> > Map<HostState, Set<TaskMetadata>> KafkaStreams.getTaskWithStoreAndKey(Key
> > k, String /* storeName */, StreamPartitioner partitioner) would return
> only
> > the host and their assigned task if the store with the store name has a
> > particular key, according to the partitioner behavior.
> >
> >
> >
> > Guozhang
> >
> >
> > On Sun, Jul 10, 2016 at 11:21 AM, Neha Narkhede <[email protected]>
> wrote:
> >
> > > Few thoughts that became apparent after observing example code of what
> an
> > > application architecture and code might look like with these changes.
> > > Apologize for the late realization hence.
> > >
> > > 1. "user.endpoint" will be very differently defined for respective
> > > applications. I don't think Kafka Streams should generalize to accept
> any
> > > connection URL as we expect to only expose metadata expressed as
> HostInfo
> > > (which is defined by host & port) and hence need to interpret the
> > > "user.endpoint" as host & port. Applications will have their own
> endpoint
> > > configs that will take many forms and they will be responsible for
> > parsing
> > > out host and port and configuring Kafka Streams accordingly.
> > >
> > > If we are in fact limiting to host and port, I wonder if we should
> change
> > > the name of "user.endpoint" into something more specific. We have
> clients
> > > expose host/port pairs as "bootstrap.servers". Should this be
> > > "application.server"?
> > >
> > > 2. I don't think we should expose another abstraction called
> > > KafkaStreamsInstance to the user. This is related to the discussion of
> > the
> > > right abstraction that we want to expose to an application. The
> > abstraction
> > > discussion itself should probably be part of the KIP itself, let me
> give
> > a
> > > quick summary of my thoughts here:
> > > 1. The person implementing an application using Queryable State has
> > likely
> > > already made some choices for the service layer–a REST framework,
> Thrift,
> > > or whatever. We don't really want to add another RPC framework to this
> > mix,
> > > nor do we want to try to make Kafka's RPC mechanism general purpose.
> > > 2. Likewise, it should be clear that the API you want to expose to the
> > > front-end/client service is not necessarily the API you'd need
> internally
> > > as there may be additional filtering/processing in the router.
> > >
> > > Given these constraints, what we prefer to add is a fairly low-level
> > > "toolbox" that would let you do anything you want, but requires to
> route
> > > and perform any aggregation or processing yourself. This pattern is
> > > not recommended for all kinds of services/apps, but there are
> definitely
> > a
> > > category of things where it is a big win and other advanced
> applications
> > > are out-of-scope.
> > >
> > > The APIs we expose should take the following things into consideration:
> > > 1. Make it clear to the user that they will do the routing,
> aggregation,
> > > processing themselves. So the bare minimum that we want to expose is
> > store
> > > and partition metadata per application server identified by the host
> and
> > > port.
> > > 2. Ensure that the API exposes abstractions that are known to the user
> or
> > > are intuitive to the user.
> > > 3. Avoid exposing internal objects or implementation details to the
> user.
> > >
> > > So tying all this into answering the question of what we should expose
> > > through the APIs -
> > >
> > > In Kafka Streams, the user is aware of the concept of tasks and
> > partitions
> > > since the application scales with the number of partitions and tasks
> are
> > > the construct for logical parallelism. The user is also aware of the
> > > concept of state stores though until now they were not user accessible.
> > > With Queryable State, the bare minimum abstractions that we need to
> > expose
> > > are state stores and the location of state store partitions.
> > >
> > > For exposing the state stores, the getStore() APIs look good but I
> think
> > > for locating the state store partitions, we should go back to the
> > original
> > > proposal of simply exposing some sort of getPartitionMetadata() that
> > > returns a PartitionMetadata or TaskMetadata object keyed by HostInfo.
> > >
> > > The application will convert the HostInfo (host and port) into some
> > > connection URL to talk to the other app instances via its own RPC
> > mechanism
> > > depending on whether it needs to scatter-gather or just query. The
> > > application will know how a key maps to a partition and through
> > > PartitionMetadata it will know how to locate the server that hosts the
> > > store that has the partition hosting that key.
> > >
> > > On Fri, Jul 8, 2016 at 9:40 AM, Michael Noll <[email protected]>
> > wrote:
> > >
> > > > Addendum in case my previous email wasn't clear:
> > > >
> > > > > So for any given instance of a streams application there will never
> > be
> > > > both a v1 and v2 alive at the same time
> > > >
> > > > That's right.  But the current live instance will be able to tell
> other
> > > > instances, via its endpoint setting, whether it wants to be contacted
> > at
> > > v1
> > > > or at v2.  The other instances can't guess that.  Think: if an older
> > > > instance would manually compose the "rest" of an endpoint URI, having
> > > only
> > > > the host and port from the endpoint setting, it might not know that
> the
> > > new
> > > > instances have a different endpoint suffix, for example).
> > > >
> > > >
> > > > On Fri, Jul 8, 2016 at 6:37 PM, Michael Noll <[email protected]>
> > > wrote:
> > > >
> > > > > Damian,
> > > > >
> > > > > about the rolling upgrade comment:  An instance A will contact
> > another
> > > > > instance B by the latter's endpoint, right?  So if A has no further
> > > > > information available than B's host and port, then how should
> > instance
> > > A
> > > > > know whether it should call B at /v1/ or at /v2/?  I agree that my
> > > > > suggestion isn't foolproof, but it is afaict better than the
> > host:port
> > > > > approach.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jul 8, 2016 at 5:15 PM, Damian Guy <[email protected]>
> > > wrote:
> > > > >
> > > > >> Michael - i'm ok with changing it to a string. Any one else have a
> > > > strong
> > > > >> opinion on this?
> > > > >>
> > > > >> FWIW - i don't think it will work fine as is during the rolling
> > > upgrade
> > > > >> scenario as the service that is listening on the port needs to be
> > > > embedded
> > > > >> within each instance. So for any given instance of a streams
> > > application
> > > > >> there will never be both a v1 and v2 alive at the same time
> (unless
> > of
> > > > >> course the process didn't shutdown properly, but then you have
> > another
> > > > >> problem...).
> > > > >>
> > > > >> On Fri, 8 Jul 2016 at 15:26 Michael Noll <[email protected]>
> > > wrote:
> > > > >>
> > > > >> > I have one further comment about
> > > `StreamsConfig.USER_ENDPOINT_CONFIG`.
> > > > >> >
> > > > >> > I think we should consider to not restricting the value of this
> > > > setting
> > > > >> to
> > > > >> > only `host:port` pairs.  By design, this setting is capturing
> > > > >> user-driven
> > > > >> > metadata to define an endpoint, so why restrict the creativity
> or
> > > > >> > flexibility of our users?  I can imagine, for example, that
> users
> > > > would
> > > > >> > like to set values such as `https://host:port/api/rest/v1/` in
> > this
> > > > >> field
> > > > >> > (e.g. being able to distinguish between `.../v1/` and `.../v2/`
> > may
> > > > >> help in
> > > > >> > scenarios such as rolling upgrades, where, during the upgrade,
> > older
> > > > >> > instances may need to coexist with newer instances).
> > > > >> >
> > > > >> > That said, I don't have a strong opinion here.
> > > > >> >
> > > > >> > -Michael
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Jul 8, 2016 at 2:55 PM, Matthias J. Sax <
> > > > [email protected]>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > +1
> > > > >> > >
> > > > >> > > On 07/08/2016 11:03 AM, Eno Thereska wrote:
> > > > >> > > > +1 (non-binding)
> > > > >> > > >
> > > > >> > > >> On 7 Jul 2016, at 18:31, Sriram Subramanian <
> > [email protected]>
> > > > >> wrote:
> > > > >> > > >>
> > > > >> > > >> +1
> > > > >> > > >>
> > > > >> > > >> On Thu, Jul 7, 2016 at 9:53 AM, Henry Cai
> > > > >> <[email protected]
> > > > >> > >
> > > > >> > > >> wrote:
> > > > >> > > >>
> > > > >> > > >>> +1
> > > > >> > > >>>
> > > > >> > > >>> On Thu, Jul 7, 2016 at 6:48 AM, Michael Noll <
> > > > >> [email protected]>
> > > > >> > > wrote:
> > > > >> > > >>>
> > > > >> > > >>>> +1 (non-binding)
> > > > >> > > >>>>
> > > > >> > > >>>> On Thu, Jul 7, 2016 at 10:24 AM, Damian Guy <
> > > > >> [email protected]>
> > > > >> > > >>> wrote:
> > > > >> > > >>>>
> > > > >> > > >>>>> Thanks Henry - we've updated the KIP with an example and
> > the
> > > > new
> > > > >> > > config
> > > > >> > > >>>>> parameter required. FWIW the user doesn't register a
> > > listener,
> > > > >> they
> > > > >> > > >>>> provide
> > > > >> > > >>>>> a host:port in config. It is expected they will start a
> > > > service
> > > > >> > > running
> > > > >> > > >>>> on
> > > > >> > > >>>>> that host:port that they can use to connect to the
> running
> > > > >> > > KafkaStreams
> > > > >> > > >>>>> Instance.
> > > > >> > > >>>>>
> > > > >> > > >>>>> Thanks,
> > > > >> > > >>>>> Damian
> > > > >> > > >>>>>
> > > > >> > > >>>>> On Thu, 7 Jul 2016 at 06:06 Henry Cai
> > > > >> <[email protected]>
> > > > >> > > >>>> wrote:
> > > > >> > > >>>>>
> > > > >> > > >>>>>> It wasn't quite clear to me how the user program
> > interacts
> > > > with
> > > > >> > the
> > > > >> > > >>>>>> discovery API, especially on the user supplied listener
> > > part,
> > > > >> how
> > > > >> > > >>> does
> > > > >> > > >>>>> the
> > > > >> > > >>>>>> user program supply that listener to KafkaStreams and
> how
> > > > does
> > > > >> > > >>>>> KafkaStreams
> > > > >> > > >>>>>> know which port the user listener is running, maybe a
> > more
> > > > >> > complete
> > > > >> > > >>>>>> end-to-end example including the steps on registering
> the
> > > > user
> > > > >> > > >>> listener
> > > > >> > > >>>>> and
> > > > >> > > >>>>>> whether the user listener needs to be involved with
> task
> > > > >> > > >>> reassignment.
> > > > >> > > >>>>>>
> > > > >> > > >>>>>>
> > > > >> > > >>>>>> On Wed, Jul 6, 2016 at 9:13 PM, Guozhang Wang <
> > > > >> [email protected]
> > > > >> > >
> > > > >> > > >>>>> wrote:
> > > > >> > > >>>>>>
> > > > >> > > >>>>>>> ＋1
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>> On Wed, Jul 6, 2016 at 12:44 PM, Damian Guy <
> > > > >> > [email protected]>
> > > > >> > > >>>>>> wrote:
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>>> Hi all,
> > > > >> > > >>>>>>>>
> > > > >> > > >>>>>>>> I'd like to initiate the voting process for KIP-67
> > > > >> > > >>>>>>>> <
> > > > >> > > >>>>>>>>
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>
> > > > >> > > >>>>>
> > > > >> > > >>>>
> > > > >> > > >>>
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-67%3A+Queryable+state+for+Kafka+Streams
> > > > >> > > >>>>>>>>>
> > > > >> > > >>>>>>>>
> > > > >> > > >>>>>>>> KAFKA-3909 <
> > > > https://issues.apache.org/jira/browse/KAFKA-3909
> > > > >> >
> > > > >> > is
> > > > >> > > >>>> the
> > > > >> > > >>>>>> top
> > > > >> > > >>>>>>>> level JIRA for this effort.
> > > > >> > > >>>>>>>>
> > > > >> > > >>>>>>>> Initial PRs for Step 1 of the process are:
> > > > >> > > >>>>>>>> Expose State Store Names <
> > > > >> > > >>>> https://github.com/apache/kafka/pull/1526>
> > > > >> > > >>>>>> and
> > > > >> > > >>>>>>>> Query Local State Stores <
> > > > >> > > >>>> https://github.com/apache/kafka/pull/1565>
> > > > >> > > >>>>>>>>
> > > > >> > > >>>>>>>> Thanks,
> > > > >> > > >>>>>>>> Damian
> > > > >> > > >>>>>>>>
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>> --
> > > > >> > > >>>>>>> -- Guozhang
> > > > >> > > >>>>>>>
> > > > >> > > >>>>>>
> > > > >> > > >>>>>
> > > > >> > > >>>>
> > > > >> > > >>>>
> > > > >> > > >>>>
> > > > >> > > >>>> --
> > > > >> > > >>>> Best regards,
> > > > >> > > >>>> Michael Noll
> > > > >> > > >>>>
> > > > >> > > >>>>
> > > > >> > > >>>>
> > > > >> > > >>>> *Michael G. Noll | Product Manager | Confluent | +1
> > > > >> > > 650.453.5860Download
> > > > >> > > >>>> Apache Kafka and Confluent Platform:
> > > www.confluent.io/download
> > > > >> > > >>>> <http://www.confluent.io/download>*
> > > > >> > > >>>>
> > > > >> > > >>>
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Best regards,
> > > > >> > Michael Noll
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > *Michael G. Noll | Product Manager | Confluent | +1
> > > > 650.453.5860Download
> > > > >> > Apache Kafka and Confluent Platform: www.confluent.io/download
> > > > >> > <http://www.confluent.io/download>*
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Michael Noll
> > > > >
> > > > >
> > > > >
> > > > > *Michael G. Noll | Product Manager | Confluent | +1 650.453.5860
> > > > > <%2B1%20650.453.5860>Download Apache Kafka and Confluent Platform:
> > > > > www.confluent.io/download <http://www.confluent.io/download>*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Michael Noll
> > > >
> > > >
> > > >
> > > > *Michael G. Noll | Product Manager | Confluent | +1
> > 650.453.5860Download
> > > > Apache Kafka and Confluent Platform: www.confluent.io/download
> > > > <http://www.confluent.io/download>*
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Neha
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [VOTE] KIP-67: Queryable state for Kafka Streams

Reply via email to