Re: [DISCUSS] Client Side Auto Topic Creation

Tommy Becker Wed, 29 Jun 2016 13:13:12 -0700

We currently run with auto topic creation enabled, largely to ensure that our topics all 
get created with the cluster defaults. My understanding is that this is the only to 
ensure this, since the defaults are not accessible to clients. We run a cluster per 
deployment, with the defaults are set by our administrators; they are not guaranteed to 
be the same everywhere. One use case we have in particular is to create topics with log 
compaction enabled. Currently, doing this from the client requires the use of AdminUtils, 
which in turn requires that you specify pretty much the entire config. I would be very 
much in favor of having to use an AdminClient to explicitly create topics if it was 
possible to only override specific settings (e.g. enable log compaction). I think any 
solution that requires developers to "think harder" about things like partition 
count would need to be accompanied by some actual guidance on how to determine such 
things. Actually, such guidance would be nice regardless.

I think Roger's suggestion of having named topic configurations for specific use cases is a great
one. Being able to make these decisions once and then have applications be able to simply create a
topic for "max-redundancy" or "high-parallelism" would be nice.

On 06/29/2016 02:17 PM, Roger Hoover wrote:

My comments go a bit beyond just topic creation but I'd like to see Kafka
make it easier for application developers to specify their requirements
declaratively in a single place. Today, for example, if your application
requires strong guarantees against data loss, you must set a mix of
topic-level configs (replication factor, min.in.sync.replicas, retention.ms)
and client configs (acks=all and
possibly max.in.flight.requests.per.connection if you care about
ordering). This can be complicated by organizational structure where you
have a different team (SREs) responsible for the cluster configs and
perhaps topic creation and application teams responsible for the client
settings. Let's say that you get all the settings right up front. How
would you know if they later were changed incorrectly? How do admins know
which topics are ok to add more partitions are which are not? How do
downstream applications know how much retention they can rely on for
re-processing in their upstream topics.

I think it's useful to consider the typical roles in an organization. Say
we have an SRE team responsible for overall cluster health, capacity, etc.
This team likely has elevated privileges and perhaps wants to
review/approve settings for new topics to make sure they're sane.

The application developer may not care about some of the details of topic
creation but does care in as much as they affect the application
correctness and SLAs. It's more than just number of partitions and
replication factor. The application may require
1) some of it's topics to be compacted to function correctly and
min.compaction.lag.ms (KIP-58) set correctly
2) retention.ms set correctly on some of it's topics to satisfy it's
failure/re-processing SLAs
3) partitioning of it's input topics to match it's expectations
4) the data format to match expectations

I realize that #3 and #4 are unrelated to topic creation but they're part
of a set of invariants that the application needs enforced and should fail
early if their requirements are not met. For example, with semantically
partitioned topics, the application may break if new partitions are added.
The issue is that there is no standard mechanism or convention to
communicate application requirements so that admins and application teams
can verify that they continue to be met over time.

Imagine for a second that Kafka allowed arbitrary tags to be associated to
topics. An application could now define a specification for it's
interaction with Kafka including topic names, min replication factors,
fault tolerance settings (replication factors, min.in.sync.replicas,
producer acks), compacted yes/no, topic retention settings, can add/remove
partitions, partition key, and data format. Some of these requirements map
onto topics configs and some (like acks=all) are producer settings and some
(like partition key and data format) could be organizational conventions
stored as tags (format:avro).

For organizations where only SREs/admins can create/modify topics, this
spec allows them to do their job while being sure they're not breaking the
application. The application can verify on startup that it's requirements
are satisfied and fail early if not. If the application has permissions to
create it's own topics then the spec is a declarative format for doing that
require and will not require the same topic creation boilerplate code to be
duplicated in every application.

If people like this approach, perhaps we could define a topic spec (if all
fields besides topic name are empty it use "cluster defaults"). Then the
AdminClient would have an idempotent create method that takes a spec and
verifies that the spec is already met, tries to create topics to meet the
spec, or fails saying it cannot be met. Perhaps the producer and consumer
APIs would only have a verify() method which checks if the spec is
satisfied.

Cheers,

Roger

On Wed, Jun 29, 2016 at 8:50 AM, Grant Henke
<ghe...@cloudera.com><mailto:ghe...@cloudera.com> wrote:

Thanks for the discussion, below are some thoughts and responses.

One of the problems that we currently have with

the clients is that we retry silently on unknown topics under the
expectation that they will eventually be created (automatically or not).
This makes it difficult to detect misconfiguration without looking for
warnings in the logs. This problem is compounded if the client isn't
authorized to the topic since then we don't actually know if the topic
exists or not and whether it is reasonable to keep retrying.

Yeah this is a problem thats difficult and opaque to the user. I think any
of the proposed solutions would help solve this issue. Since the create
would be done at the metadata request phase, instead of in the produce
response handling. And if the create fails, the user would receive a munch
more clear authorization error.

The current auto creation of topic by the broker appear to be the only

reason an unknown topic error is retriable
which leads to bugs (like

https://issues.apache.org/jira/browse/KAFKA-3727

) where the consumer hangs forever (or until woken up) and only debug
tracing shows what's going on.

I agree this is related, but should be solvable even with retriable
exceptions. I think UnknownTopicOrPartitionException needs to remain
generally retriable because it could occur due to outdated metadata and not
because a topic needs to be created. In the case of message production or
consumption it could be explicitly handled differently in the client.

Do we clearly define the expected behavior of subscribe and assign in the
case of a missing topic? I can see reasons to fail early (partition will
never exist, typo in topic name) and reasons to keep returning empty record
sets until the topic exists (consumer with a preconfigured list of topics
that may or may not exist). Though I think failing and insisting topics
exist is the most predictable. Especially since the Admin API will make
creating topics easier.

Usually in the pre-prod environments you don't really

care about the settings at all, and in prod you can pre-provision.

I like the recommendations, developer/ops experience and required exercises
to be fairly consistent between dev, qa, and prod. If you need to
pre-provision and think about the settings in prod. Its best to put some
effort into building that logic in dev or qa too. Otherwise you get ready
to deploy and everything changes and all your earlier testing is not as
relevant.

For what it's worth the use case for auto-creation isn't using a dynamic

set of topics, but rather letting apps flow through different
dev/staging/prod/integration_testing/unit_testing environments without
having the app configure appropriate replication/partitioning stuff in

each

environment and having complex logic to check if the topic is there.

The problem I have seen here is that the cluster default is global, at
least until we have some concept of namespaces and can configure defaults
for each. Since picking a good number of partitions varies based on volume,
use case, etc a default that works for most topics is a hard to find.

I feel like because app developers think they don't need to think about
topic creation, often they don't. And that leads to a mess where they don't
know how may partitions and what replication factor they have. Instead
migrating environments with a setup script that creates the needed topics
allows them to source control those setting and create predictable,
repeatable deployments.

I have also seen a lot of issues where users are confused about why a topic
is coming back or can't be deleted. This is often a result
of auto.create.topics.enable being defaulted to true. And they never expect
that a feature like that would exist, much less be the default.

On a side note, the best dynamic use case I could think of is MirrorMaker.
But the cluster defaults here don't really work since its they are not very
flexible. Pushing creation to the client would allow tools like MirrorMaker
to create topics that match the upstream cluster, or provide its own logic
for sizing downstream topics.

This raises an important point about how we handle defaults, which I don't

think we talked about. I do think it is really important that we allow a
way to create topics with the "cluster defaults". I know this is possible
for configs since if you omit them they inherit default values, but I

think

we should be able to do it with replication factor and partition count

too.

I think the Java API should expose this and maybe even encourage it.

We could make the create topic request num_partitions and
replication_factor fields optional and if unset use the cluster defaults.
This allows a user to opt into the cluster defaults at create time. I have
rarely seen good defaults set in my experience though, especially since the
default is 1 in both cases.

I kind of feel once you start adding AdminClient methods to the producer

and consumer it's not really clear where to stop--e.g. if I can create I
should be able to delete, list, etc.

I agree this gets weird and could lead to duplicate client code and
inconsistent behavior across clients. The one thing I don't like about
requiring a separate client is it maintains all its own connections and
metadata. Perhaps sometime down the road if we see a lot of mixed usage we
could break out the core cluster connection code into a KafkaConnection
class and instantiate clients with that. That way clients could share the
same KafkaConnection.

Thanks,
Grant

On Wed, Jun 29, 2016 at 9:29 AM, Jay Kreps
<j...@confluent.io><mailto:j...@confluent.io> wrote:

For what it's worth the use case for auto-creation isn't using a dynamic
set of topics, but rather letting apps flow through different
dev/staging/prod/integration_testing/unit_testing environments without
having the app configure appropriate replication/partitioning stuff in

each

environment and having complex logic to check if the topic is there.
Basically if you leave this up to individual apps you get kind of a mess,
it's better to have cluster defaults that are reasonable and controlled

an admin and then pre-provision anything that is weird (super big,

unusual

perms, whatever). Usually in the pre-prod environments you don't really
care about the settings at all, and in prod you can pre-provision.

This raises an important point about how we handle defaults, which I

don't

think

we should be able to do it with replication factor and partition count

too.

I think the Java API should expose this and maybe even encourage it.

I don't have a super strong opinion on how this is exposed, though I kind
of prefer one of two options:
1. Keep the approach we have now with a config option to allow auto

create,

but using this option just gives you a plain vanilla topic with no custom
configs, for anything custom you need to use AdminClient "manually"
2. Just throw an exception and let you use AdminClient. This may be a bit
of a transition for people relying on the current behavior.

I kind of feel once you start adding AdminClient methods to the producer
and consumer it's not really clear where to stop--e.g. if I can create I
should be able to delete, list, etc.

-Jay

On Tue, Jun 28, 2016 at 9:26 AM, Grant Henke
<ghe...@cloudera.com><mailto:ghe...@cloudera.com>

wrote:

With the KIP-4 create topic schema voted and passed and a PR available
upstream. I wanted to discuss moving the auto topic creation from the
broker side to the client side (KAFKA-2410
<https://issues.apache.org/jira/browse/KAFKA-2410><https://issues.apache.org/jira/browse/KAFKA-2410>).

This change has many benefits

- Remove the need for failed messages until a topic is created
- Client can define the auto create parameters instead of a global
cluster setting
- Errors can be communicated back to the client more clearly

Overall auto create is not my favorite feature, since topic creation

is a

highly critical piece for Kafka, and with authorization added it

becomes

even more involved. When creating a topic a user needs:

- The access to create topics
- To set the correct partition count and replication factor for

their

use case
- To set who has access to the topic
- Knowledge of how a new topic may impact regex consumers or

mirrormaker

Often I find use cases that look like they need auto topic creation,

can

often be handled with a few pre made topics. That said, we still should
support the feature for the cases that need it (mirrormaker, streams).

The question is how we should expose auto topic creation in the

client. A

few options are:

- Add configs like the broker configs today, and let the client
automatically create the topics if enabled
- Both producer and consumer?
- Throw an error to the user and let them use a separate AdminClient
(KIP-4) api to create the topic
- Throw an error to the user and add a create api to the producer so
they can easily handle by creating a topic

I am leaning towards the last 2 options but wanted to get some others
thoughts on the matter. Especially if you have use cases that use auto
topic creation today.

Thanks,
Grant

--
Grant Henke
Software Engineer | Cloudera
gr...@cloudera.com<mailto:gr...@cloudera.com> | twitter.com/gchenke |
linkedin.com/in/granthenke

--
Tommy Becker
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com>
tobec...@tivo.com<mailto:tobec...@tivo.com>

________________________________

This email and any attachments may contain confidential and privileged material
for the sole use of the intended recipient. Any review, copying, or
distribution of this email (or any attachments) by others is prohibited. If you
are not the intended recipient, please contact the sender immediately and
permanently delete this email and any attachments. No employee or agent of TiVo
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by
email. Binding agreements with TiVo Inc. may only be made by a signed written
agreement.

Re: [DISCUSS] Client Side Auto Topic Creation

Reply via email to